machine learning for signal processing regression and...

86
Machine Learning for Signal Processing Regression and Prediction Class 14. 25 Oct 2016 Instructor: Bhiksha Raj 11755/18797 1

Upload: others

Post on 22-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Machine Learning for Signal Processing

Regression and Prediction

Class 14. 25 Oct 2016

Instructor: Bhiksha Raj

11755/18797 1

Page 2: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A Common Problem

• Can you spot the glitches? 11755/18797 2

Page 3: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

How to fix this problem?

• “Glitches” in audio – Must be detected – How?

• Then what?

• Glitches must be “fixed” – Delete the glitch

• Results in a “hole”

– Fill in the hole – How?

11755/18797 3

Page 4: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

11755/18797 4

Interpolation..

• “Extend” the curve on the left to “predict” the values in the “blank” region

– Forward prediction

• Extend the blue curve on the right leftwards to predict the blank region

– Backward prediction

• How?

– Regression analysis..

Page 5: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Detecting the Glitch

• Regression-based reconstruction can be done anywhere

• Reconstructed value will not match actual value

• Large error of reconstruction identifies glitches

11755/18797 5

NOT OK OK

Page 6: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

What is a regression

• Analyzing relationship between variables

• Expressed in many forms

• Wikipedia – Linear regression, Simple regression, Ordinary

least squares, Polynomial regression, General linear model, Generalized linear model, Discrete choice, Logistic regression, Multinomial logit, Mixed logit, Probit, Multinomial probit, ….

• Generally a tool to predict variables

11755/18797 6

Page 7: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Regressions for prediction

• y = f(x; Q) + e

• Different possibilities – y is a scalar

• y is real • y is categorical (classification)

– y is a vector – x is a vector

• x is a set of real valued variables • x is a set of categorical variables • x is a combination of the two

– f(.) is a linear or affine function – f(.) is a non-linear function – f(.) is a time-series model

11755/18797 7

Page 8: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A linear regression

• Assumption: relationship between variables is linear – A linear trend may be found relating x and y

– y = dependent variable

– x = explanatory variable

– Given x, y can be predicted as an affine function of x

11755/18797 8

X

Y

Page 9: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

An imaginary regression..

• http://pages.cs.wisc.edu/~kovar/hall.html

• Check this shit out (Fig. 1). That's bonafide, 100%-real data, my friends. I took it myself over the course of two weeks. And this was not a leisurely two weeks, either; I busted my ass day and night in order to provide you with nothing but the best data possible. Now, let's look a bit more closely at this data, remembering that it is absolutely first-rate. Do you see the exponential dependence? I sure don't. I see a bunch of crap. Christ, this was such a waste of my time. Banking on my hopes that whoever grades this will just look at the pictures, I drew an exponential through my noise. I believe the apparent legitimacy is enhanced by the fact that I used a complicated computer program to make the fit. I understand this is the same process by which the top quark was discovered.

11755/18797 9

Page 10: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Linear Regressions

• y = aTx + b + e – e = prediction error

• Given a “training” set of {x, y} values: estimate a and b – y1 = aTx1 + b + e1

– y2 = aTx2 + b + e2

– y3 = aTx3 + b+ e3

– …

• If a and b are well estimated, prediction error will be small

11755/18797 10

Page 11: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Linear Regression to a scalar

• Rewrite

11755/18797 11

...] [ 321 yyyy

...

1

1

1321 xxx

X

ba

A

...] [ 321 eeee

Define:

eXAy T

y1 = aTx1 + b + e1

y2 = aTx2 + b + e2

y3 = aTx3 + b + e3

Page 12: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Learning the parameters

• Given training data: several x,y

• Can define a “divergence”: D(𝐲, 𝐲 )

– Measures how much 𝐲 differs from y

– Ideally, if the model is accurate this should be small

• Estimate a, b to minimize D(𝐲, 𝐲 )

11755/18797 12

XAyTˆ Assuming no error

eXAy T

Page 13: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

The prediction error as divergence

• Define divergence as sum of the squared error in predicting y

11755/18797 13

eyeXay ˆ T

y1 = aTx1 + b + e1

y2 = aTx2 + b + e2

y3 = aTx3 + b + e3

...ˆ 2

3

2

2

2

1 eeeE)yD(y,

...)()()( 2

33

2

22

2

11 bybyby TTTxaxaxa

2

XAyXAyXAyETTTT

Page 14: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Prediction error as divergence

• y = ATx + e

– e = prediction error

– Find the “slope” a such that the total squared length of the error lines is minimized

11755/18797 14

Page 15: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Solving a linear regression

• Minimize squared error

11755/18797 15

eXAy T

2||XAy||E

T

XyA pinvT

TTyXA pinv

Page 16: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

More Explicitly • Minimize squared error

• Differentiating w.r.t A and equating to 0

11755/18797 16

TTTT ))(( XAyXAy||AXy||E2

AyX-AXXAyyTTTT 2

022 AyX-XXAE dd TTT

XyXXyXA-1

pinvTTT

TTXyXXA

-1

Page 17: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Regression in multiple dimensions

• Also called multiple regression

• Equivalent of saying:

• Fundamentally no different from N separate single regressions

– But we can use the relationship between ys to our benefit

11755/18797 17

y1 = ATx1 + b + e1

y2 = ATx2 + b + e2

y3 = ATx3 + b + e3

yi is a vector

yi = ATxi + b + ei

yi1 = a1Txi + b1 + ei1

yi2 = a2Txi + b2 + ei2

yi3 = a3Txi + b3 + ei3

yij = jth component of vector yi

ai = ith column of A

bj = jth component of b

Page 18: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Multiple Regression

• Minimizing

11755/18797 18

...] [ 321 yyyY

... 321

1

x

1

x

1

xX

b

AA

...] [ 321 eeeE

EXAY Tˆ

i

i

T

iDIV2

ˆ xAy

-1XXYXXYATTT pinv ˆ

ˆ TTXYXXA

-1

Page 19: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A Different Perspective

• y is a noisy reading of ATx

• Error e is Gaussian

• Estimate A from

11755/18797 19

exAy T

),0( 2I~e N

]... [ ]... [ 2121 NN xxxXyyyY

= +

Page 20: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

The Likelihood of the data

• Probability of observing a specific y, given x, for a particular matrix A

• Probability of collection:

• Assuming IID for convenience (not necessary)

11755/18797 20

exAy T),0( 2

I~e N

),;();|( 2IxAyAxy TNP

]... [ ]... [ 2121 NN xxxXyyyY

i

i

T

iNP ),;();|( 2IxAyAXY

Page 21: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A Maximum Likelihood Estimate

• Maximizing the log probability is identical to minimizing the error

– Identical to the least squares solution

11755/18797 21

exAy T ),0( 2I~e N ]... [ ]... [ 2121 NN xxxXyyyY

i

i

T

iD

P2

22 2

1exp

)2(

1)|( xAyXY

i

i

T

iCP2

22

1);|(log xAyAXY

XYXXYXA-1

pinvTTT TTXYXXA

-1

Page 22: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting an output

• From a collection of training data, have learned A

• Given x for a new instance, but not y, what is y?

• Simple solution:

11755/18797 22

XAyTˆ

Page 23: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Applying it to our problem

11755/18797 23

• Prediction by regression

• Forward regression

• xt = a1xt-1+ a2xt-2…akxt-k+et

• Backward regression

• xt = b1xt+1+ b2xt+2…bkxt+k +et

Page 24: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Applying it to our problem

• Forward prediction

11755/18797 24

1

1

11

132

21

1

1

..

..

........

..

..

..

K

t

t

t

KK

Kttt

Kttt

K

t

t

e

e

e

xxx

xxx

xxx

x

x

x

a

eaXx t

tpinv axX )(

Page 25: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Applying it to our problem

• Backward prediction

11755/18797 25

1

2

1

21

121

1

1

2

1

..

..

........

..

..

..

e

e

e

xxx

xxx

xxx

x

x

x

Kt

Kt

t

KK

Kttt

Kttt

Kt

Kt

b

ebXx t

tpinv bxX )(

Page 26: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Finding the burst

• At each time

– Learn a “forward” predictor at

– At each time, predict next sample xtest = Si at,k xt-k

– Compute error: ferrt=|xt-xtest |2

– Learn a “backward” predict and compute backward error • berrt

– Compute average prediction error over window, threshold

– If the error exceeds a threshold, identify burst 11755/18797 26

Page 27: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Filling the hole

• Learn “forward” predictor at left edge of “hole”

– For each missing sample

– At each time, predict next sample xtest = Si at,kxt-k

• Use estimated samples if real samples are not available

• Learn “backward” predictor at left edge of “hole”

– For each missing sample

– At each time, predict next sample xtest = Si bt,kxt+k

• Use estimated samples if real samples are not available

• Average forward and backward predictions

11755/18797 27

Page 28: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Reconstruction zoom in

11755/18797 28

Next

glitch

Interpolation

result

Reconstruction area

Actual

data

Distorted

signal

Recovered

signal

Page 29: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Incrementally learning the regression

• Can we learn A incrementally instead? – As data comes in?

• The Widrow Hoff rule

• Note the structure – Can also be done in batch mode!

11755/18797 29

Requires knowledge of

all (x,y) pairs TT

XYXXA-1

t

Tt

tttt

tt yyy xaxaa ˆ ˆ1

error

Scalar prediction version

Page 30: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting a value

• What are we doing exactly? – For the explanation we are assuming no “b” (X is 0 mean) – Explanation generalizes easily even otherwise

11755/18797 30

TTXYXXA

-1 xXXYXxAy

1 ˆ

TTT

TXXC

xCx 2

1

ˆ

i

TTxXYxCCYXy ˆˆˆ 2

1

2

1

Let and Whitening x

N-0.5 C-0.5 is the whitening matrix for x

XCX 2

1

ˆ

Page 31: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting a value

• What are we doing exactly?

11755/18797 31

i

T

ii

TxxyxXYy ˆˆˆˆˆ

i

T

ii

T

N

T

N

T

Nxxyx

x

x

yyxXYy ˆˆˆ

ˆ

ˆ

...1

ˆˆˆ

1

1

Page 32: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting a value

• Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

• y is simply a weighted sum of the yi instances from the

training data

• The weight of any yi is simply the inner product

between its corresponding xi and the new x

– With due whitening and scaling..

11755/18797 32

i

T

ii xxyy ˆˆˆ

Page 33: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

What are we doing: A different perspective

• Assumes XXT is invertible

• What if it is not

– Dimensionality of X is greater than number of observations?

– Underdetermined

• In this case XTX will generally be invertible

11755/18797 33

xXXYXxAy1

ˆ

TTT

xXXXYyTT 1

ˆ

TTYXXXA

1

Page 34: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

High-dimensional regression

• XTX is the “Gram Matrix”

11755/18797 34

xXXXYyTT 1

ˆ

N

T

N

T

N

T

N

N

TTT

N

TTT

xxxxxx

xxxxxx

xxxxxx

G

21

22212

12111

xXYGyT1 ˆ

Page 35: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

High-dimensional regression

• Normalize Y by the inverse of the gram matrix

11755/18797 35

xXYGyT1 ˆ

1 YGY

xXYyT ˆ xxyy

i

T

ii ˆ

• Working our way down..

Page 36: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Linear Regression in High-dimensional Spaces

• Given training instances (xi,yi) for i = 1..N, estimate y

for a new test instance of x with unknown y :

• y is simply a weighted sum of the normalized yi

instances from the training data

– The normalization is done via the Gram Matrix

• The weight of any yi is simply the inner product

between its corresponding xi and the new x

11755/18797 36

xxyy i

T

ii ˆ 1 YGY

Page 37: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Relationships are not always linear

• How do we model these?

• Multiple solutions

11755/18797 37

Page 38: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Non-linear regression

• y = Aj(x)+e

11755/18797 38

)]( ... )( )([)( 21 xxxxφx N

)]( ... )( )([)( 21 KxφxφxφXX

Y = A(X)+e

Replace X with (X) in earlier equations for solution

)()()( TTYXXXA

-1

Page 39: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Problem

11755/18797 39

Y = A(X)+e

Replace X with (X) in earlier equations for solution

(X) may be in a very high-dimensional space

The high-dimensional space (or the transform (X)) may be unknown..

)()()( TTYXXXA

-1

Page 40: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

The regression is in high dimensions

• Linear regression:

• High-dimensional regression

11755/18797 40

xxyy i

T

ii ˆ 1 YGY

N

T

N

T

N

T

N

TTT

N

TTT

xxxxxx

xxxxxx

xxxxxx

G

211

22212

12211

1 YGY i

T

ii xxyy ˆ

Page 41: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Doing it with Kernels

• High-dimensional regression with Kernels:

• Regression in Kernel Hilbert Space..

11755/18797 41

),(),(),(

),(),(),(

),(),(),(

21

22212

11111

NNNN

N

N

KKK

KKK

KKK

xxxxxx

xxxxxx

xxxxxx

G

yxyx T

K ),(

1 YGY i

iiK ),( ˆ xxyy

Page 42: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A different way of finding nonlinear relationships: Locally linear regression

• Previous discussion: Regression parameters are optimized over the entire training set

• Minimize

• Single global regression is estimated and applied to all future x

• Alternative: Local regression

• Learn a regression that is specific to x

11755/18797 42

iall

i

T

i

2

bxAyE

Page 43: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Being non-committal: Local Regression

• Estimate the regression to be applied to any x using training instances near x

• The resultant regression has the form

– Note : this regression is specific to x

• A separate regression must be learned for every x

11755/18797 43

eyxxyxx

j

odneighborho

j

j

d)(

),(

)(

2

xx

bxAyEodneighborho

i

T

i

j

Page 44: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Local Regression

• But what is d()?

– For linear regression d() is an inner product

• More generic form: Choose d() as a function of the distance between x and xj

• If d() falls off rapidly with |x and xj| the “neighbhorhood” requirement can be relaxed

11755/18797 44

eyxxyxx

j

odneighborho

j

j

d)(

),(

eyxxy j

all

jd

),(

Page 45: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Kernel Regression: d() = K()

• Typical Kernel functions: Gaussian, Laplacian, other

density functions

– Must fall off rapidly with increasing distance between x

and xj

• Regression is local to every x : Local regression

• Actually a non-parametric MAP estimator of y

– But first.. MAP estimators.. 45

i

ih

i

i

ih

K

K

)(

)(

ˆxx

yxx

y

11755/18797

Page 46: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Map Estimators

• MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x)

• ML (Maximum Likelihood): Find that value of y for which the statistical best guess of x would have been the observed x y = argmax Y P(x|Y)

• MAP is simpler to visualize

11755/18797 46

Page 47: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation: Gaussian PDF

11755/18797 47

F1 F0

Assume X

and Y are

jointly

Gaussian

The parameters of the

Gaussian are learned from training

data

X Y

Page 48: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Learning the parameters of the Gaussian

11755/18797 48

x

yz

N

i

iN 1

1zz

x

yz

YYYX

XYXX

CC

CCCz

Ti

N

i

iN

C zzz zz 1

1

Page 49: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Learning the parameters of the Gaussian

11755/18797 49

x

yz

N

i

iN 1

1zz Ti

N

i

iN

C zzz zz 1

1

x

yz

YYYX

XYXX

CC

CCCz

N

i

iN 1

1xx

Ti

N

i

iXYN

C yx yx 1

1

Page 50: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation: Gaussian PDF

11755/18797 50

F1 F0

Assume X

and Y are

jointly

Gaussian

The parameters of the

Gaussian are learned from training

data

X Y

Page 51: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP Estimator for Gaussian RV

11755/18797 51

Assume X

and Y are

jointly

Gaussian

The parameters

of the Gaussian

are learned from

training data

Now we are given an X, but no Y

What is Y?

Level set of Gaussian

X0

Page 52: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimator for Gaussian RV

11755/18797 52

x0

Page 53: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation: Gaussian PDF

11755/18797 53

F1 X Y

Page 54: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

F1

MAP estimation: The Gaussian at a particular value of X

11755/18797 54

x0

Page 55: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

F1

MAP estimation: The Gaussian at a particular value of X

11755/18797 55

Most likely

value

x0

Page 56: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP Estimation of a Gaussian RV

11755/18797 56

Y = argmaxy P(y| X) ???

x0

Page 57: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP Estimation of a Gaussian RV

11755/18797 57

x0

Page 58: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP Estimation of a Gaussian RV

11755/18797 58

x0

Y = argmaxy P(y| X)

Page 59: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

So what is this value?

• Clearly a line

• Equation of Line:

• Scalar version given; vector version is identical

• Derivation? Later in the program a bit – Note the similarity to regression

11755/18797 59

xXXYXY xCCy 1ˆ

xxy 1ˆXXYXY CC

Page 60: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

This is a multiple regression

• This is the MAP estimate of y

– y = argmax Y P(Y|x)

• What about the ML estimate of y

– argmax Y P(x|Y)

• Note: Neither of these may be the regression line!

– MAP estimation of y is the regression on Y for Gaussian RVs

– But this is not the MAP estimation of the regression parameter

11755/18797 60

xxy 1ˆXXYXY CC

Page 61: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

• A familiar equation

• Linear regression actually gives you an MAP estimate under Gaussian assumption

A Closer Look

• Assuming 0 mean for simplicity

11755/18797 61

xxy 1ˆXXYXY CC

xy1ˆ XXYXCC xYXy

1ˆ XX

TC xXYy ˆˆˆ T

whitened

trainingi

T

ii xxyy ˆˆˆ

Page 62: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Its also a minimum-mean-squared error estimate

• General principle of MMSE estimation:

– y is unknown, x is known

– Must estimate it such that the expected squared error is minimized

– Minimize above term

11755/18797 62

]|ˆ[2

xyy EErr

Page 63: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Its also a minimum-mean-squared error estimate

• Minimize error:

• Differentiating and equating to 0:

11755/18797 63

]|ˆˆ[]|ˆ[2

xyyyyxyy T

EEErr

]|[ˆ2ˆˆ]|[]|ˆ2ˆˆ[ xyyyyxyyxyyyyyy EEEErr TTTTTT

0ˆ]|[2ˆˆ2. yxyyy dEdErrd TT

]|[ˆ xyy E The MMSE estimate is the mean of the distribution

Page 64: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

For the Gaussian: MAP = MMSE

11755/18797 64

Most likely

value

is also

The MEAN

value

Would be true of any symmetric distribution

Page 65: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE estimates for mixture distributions

65

Let P(y|x) be a mixture density

The MMSE estimate of y is given by

Just a weighted combination of the MMSE estimates from the component distributions

),|()()|( xyxy kPkPPk

yxyyxy dkPkPEk

),|()(]|[ yxyy dkPkPk

),|()(

],|[)( xy kEkPk

11755/18797

Page 66: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE estimates from a Gaussian mixture

11755/18797 66

P(y|x) is also a Gaussian mixture

Let P(x,y) be a Gaussian Mixture

),;()()()( kk

k

NkPPP S zzyx,

x

yz

)(

),|()|()(

)(

),,(

)(

)()|(

x

xyxx

x

yx

x

yx,xy

P

kPkPP

P

kP

P

PP kk

k

kPkPP ),|()|()|( xyxxy

Page 67: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE estimates from a Gaussian mixture

11755/18797 67

Let P(y|x) is a Gaussian Mixture

k

kPkPP ),|()|()|( xyxxy

)],;[]; ;([),,(,,

,,

,,

xxxy

yxyy

xyxyxykk

kk

kk CC

CCNkP

)),(;(),|( ,

1

,,, Q

xxxyxy xyxy kkkk CCNkP

Q

k

kkkk CCNkPP )),(;()|()|( ,

1

,,, xxxyxy xyxxy

Page 68: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE estimates from a Gaussian mixture

11755/18797 68

E[y|x] is also a mixture

P(y|x) is a mixture Gaussian density

Q

k

kkkk CCNkPP )),(;()|()|( ,

1

,,, xxxyxy xyxxy

k

kEkPE ],|[)|(]|[ xyxxy

k

kkkk CCkPE )()|(]|[ ,

1

,,, xxxyxy xxxy

Page 69: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE estimates from a Gaussian mixture

11755/18797 69

A mixture of estimates from individual Gaussians

Page 70: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Voice Morphing

• Align training recordings from both speakers

– Cepstral vector sequence

• Learn a GMM on joint vectors

• Given speech from one speaker, find MMSE estimate of the other

• Synthesize from cepstra

11755/18797 70

Page 71: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MMSE with GMM: Voice Transformation

- Festvox GMM transformation suite (Toda)

awb bdl jmk slt

awb

bdl

jmk

slt

11755/18797 71

Page 72: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

A problem with regressions

• ML fit is sensitive

– Error is squared

– Small variations in data large variations in weights

– Outliers affect it adversely

• Unstable

– If dimension of X >= no. of instances

• (XXT) is not invertible

11755/18797 72

TTXYXXA

-1

Page 73: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation of weights

• Assume weights drawn from a Gaussian – P(a) = N(0, 2I)

• Max. Likelihood estimate

• Maximum a posteriori estimate

11755/18797 73

a

X e

y=aTX+e

);|(logmaxargˆ aXya a P

)(),|(logmaxarg),|(logmaxargˆ aaXyXyaa aa PPP

Page 74: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation of weights

• Similar to ML estimate with an additional term

11755/18797 74

)(),|(logmaxarg),|(logmaxargˆ aaXyXyaa AA PPP

P(a) = N(0, 2I)

Log P(a) = C – log – 0.5-2 ||a||2

TTTTCP )()(

2

1)(log

2XayXayaX,|y

aaXayXaya A

TTTTTC 2

25.0)()(

2

1log'maxargˆ

Page 75: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimate of weights

• Equivalent to diagonal loading of correlation matrix

– Improves condition number of correlation matrix • Can be inverted with greater stability

– Will not affect the estimation from well-conditioned data

– Also called Tikhonov Regularization • Dual form: Ridge regression

• MAP estimate of weights

– Not to be confused with MAP estimate of Y

11755/18797 75

0222 aIyXXXa ddL TTT

TTXYIXXa

-1

Page 76: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimate priors

• Left: Gaussian Prior on W

• Right: Laplacian Prior 11755/18797 76

Page 77: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation of weights with Laplacian prior

• Assume weights drawn from a Laplacian – P(a) = l-1exp(-l-1|a|1)

• Maximum a posteriori estimate

• No closed form solution – Quadratic programming solution required

• Non-trivial

11755/18797 77

1

1)()('maxargˆ aXayXaya A

lTTTTC

Page 78: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

MAP estimation of weights with Laplacian prior

• Assume weights drawn from a Laplacian

– P(a) = l-1exp(-l-1|a|1)

• Maximum a posteriori estimate

– …

• Identical to L1 regularized least-squares estimation

11755/18797 78

1

1)()('maxargˆ aXayXaya A

lTTTTC

Page 79: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

L1-regularized LSE

• No closed form solution – Quadratic programming solutions required

• Dual formulation

• “LASSO” – Least absolute shrinkage and selection operator

11755/18797 79

1

1)()('maxargˆ aXayXaya A

lTTTTC

TTTTC )()('maxargˆ XayXaya A t1

asubject to

Page 80: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

LASSO Algorithms

• Various convex optimization algorithms

• LARS: Least angle regression

• Pathwise coordinate descent..

• Matlab code available from web

11755/18797 80

Page 81: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Regularized least squares

• Regularization results in selection of suboptimal (in least-squares sense) solution

– One of the loci outside center

• Tikhonov regularization selects shortest solution

• L1 regularization selects sparsest solution

11755/18797 81

Image Credit: Tibshirani

Page 82: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

LASSO and Compressive Sensing

• Given Y and X, estimate sparse a • LASSO:

– X = explanatory variable – Y = dependent variable – a = weights of regression

• CS: – X = measurement matrix – Y = measurement – a = data

11755/18797 82

Y = X a

Page 83: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

An interesting problem: Predicting War!

• Economists measure a number of social indicators for countries weekly – Happiness index

– Hunger index

– Freedom index

– Twitter records

– …

• Question: Will there be a revolution or war next week?

11755/18797 83

Page 84: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

An interesting problem: Predicting War!

• Issues: – Dissatisfaction builds up – not an instantaneous

phenomenon • Usually

– War / rebellion build up much faster • Often in hours

• Important to predict – Preparedness for security

– Economic impact

11755/18797 84

Page 85: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting War

Given – Sequence of economic indicators for each week

– Sequence of unrest markers for each week • At the end of each week we know if war happened or not

that week

• Predict probability of unrest next week – This could be a new unrest or persistence of a current

one

11755/18797 85

W

S

W

S

W

S

W

S

W

S

W

S

W

S

W

S

wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8

O1 O2 O3 O4 O5 O6 O7 O8

Page 86: Machine Learning for Signal Processing Regression and ...mlsp.cs.cmu.edu/courses/fall2016/slides/Lecture14... · Regression in multiple dimensions •Also called multiple regression

Predicting Time Series

• Need time-series models

• HMMs – later in the course

11755/18797 86