parameter learning in markov nets

29
1 Parameter learning in Markov nets Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 17 th , 2008 Readings: K&F: 19.1, 19.2, 19.3, 19.4 10-708 – Carlos Guestrin 2006-2008

Upload: natan

Post on 07-Jan-2016

47 views

Category:

Documents


1 download

DESCRIPTION

Readings: K&F: 19.1, 19.2, 19.3, 19.4. Parameter learning in Markov nets. Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 17 th , 2008. Learning Parameters of a BN. C. D. I. G. S. L. J. H. Log likelihood decomposes: Learn each CPT independently - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parameter learning in  Markov nets

1

Parameter learning in Markov nets

Graphical Models – 10708

Carlos Guestrin

Carnegie Mellon University

November 17th, 2008

Readings:K&F: 19.1, 19.2, 19.3, 19.4

10-708 – Carlos Guestrin 2006-2008

Page 2: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 2

Learning Parameters of a BN

Log likelihood decomposes:

Learn each CPT independently Use counts

D

SG

HJ

C

L

I

Page 3: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 3

Log Likelihood for MN

Log likelihood of the data:

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Page 4: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 4

Log Likelihood doesn’t decompose for MNs Log likelihood:

A concave problem Can find global optimum!!

Term log Z doesn’t decompose!!

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Page 5: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 5

Derivative of Log Likelihood for MNs

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Page 6: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 6

Derivative of Log Likelihood for MNs 2

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Page 7: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 7

Derivative of Log Likelihood for MNs

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence Derivative:

Computing derivative requires inference:

Can optimize using gradient ascent Common approach Conjugate gradient, Newton’s method,…

Let’s also look at a simpler solution

Page 8: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 8

Iterative Proportional Fitting (IPF)

Difficulty

SATGrade

HappyJob

Coherence

Letter

Intelligence

Setting derivative to zero:

Fixed point equation:

Iterate and converge to optimal parameters Each iteration, must compute:

Page 9: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 9

Log-linear Markov network(most common representation)

Feature is some function [D] for some subset of variables D e.g., indicator function

Log-linear model over a Markov network H: a set of features 1[D1],…, k[Dk]

each Di is a subset of a clique in H two ’s can be over the same variables

a set of weights w1,…,wk

usually learned from data

Page 10: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 10

Learning params for log linear models – Gradient Ascent

Log-likelihood of data:

Compute derivative & optimize usually with conjugate gradient ascent

Page 11: Parameter learning in  Markov nets

Derivative of log-likelihood 1 – log-linear models

10-708 – Carlos Guestrin 2006-2008 11

Page 12: Parameter learning in  Markov nets

Derivative of log-likelihood 2 – log-linear models

10-708 – Carlos Guestrin 2006-2008 12

Page 13: Parameter learning in  Markov nets

Learning log-linear models with gradient ascent

Gradient:

Requires one inference computation per

Theorem: w is maximum likelihood solution iff

Usually, must regularize E.g., L2 regularization on parameters

10-708 – Carlos Guestrin 2006-2008 13

Page 14: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 14

What you need to know about learning MN parameters?

BN parameter learning easy MN parameter learning doesn’t decompose!

Learning requires inference!

Apply gradient ascent or IPF iterations to obtain optimal parameters

Page 15: Parameter learning in  Markov nets

15

Conditional Random Fields

Graphical Models – 10708

Carlos Guestrin

Carnegie Mellon University

November 17th, 2008

Readings:K&F: 19.3.2

10-708 – Carlos Guestrin 2006-2008

Page 16: Parameter learning in  Markov nets

Generative v. Discriminative classifiers – A review

Want to Learn: h:X Y X – features Y – target classes

Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes:

Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model

Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X|y)

Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model

Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available

1610-708 – Carlos Guestrin 2006-2008

Page 17: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 17

Log-linear CRFs(most common representation)

Graph H: only over hidden vars Y1,..,Yn

No assumptions about dependency on observed vars X You must always observe all of X

Feature is some function [D] for some subset of variables D e.g., indicator function,

Log-linear model over a CRF H: a set of features 1[D1],…, k[Dk]

each Di is a subset of a clique in H two ’s can be over the same variables

a set of weights w1,…,wk

usually learned from data

Page 18: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 18

Learning params for log linear CRFs – Gradient Ascent

Log-likelihood of data:

Compute derivative & optimize usually with conjugate gradient ascent

Page 19: Parameter learning in  Markov nets

Learning log-linear CRFs with gradient ascent

Gradient:

Requires one inference computation per

Usually, must regularize E.g., L2 regularization on parameters

10-708 – Carlos Guestrin 2006-2008 19

Page 20: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 20

What you need to know about CRFs

Discriminative learning of graphical models Fewer assumptions about distribution often

performs better than “similar” MN

Gradient computation requires inference per datapoint Can be really slow!!

Page 21: Parameter learning in  Markov nets

21

EM for BNs

Graphical Models – 10708

Carlos Guestrin

Carnegie Mellon University

November 17th, 2008

Readings: 18.1, 18.2

10-708 – Carlos Guestrin 2006-2008

Page 22: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 22

Thus far, fully supervised learning

We have assumed fully supervised learning:

Many real problems have missing data:

Page 23: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 23

The general learning problem with missing data

Marginal likelihood – x is observed, z is missing:

Page 24: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 24

E-step

x is observed, z is missing Compute probability of missing data given current choice of

Q(z|x(j)) for each x(j) e.g., probability computed during classification step corresponds to “classification step” in K-means

Page 25: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 25

Jensen’s inequality

Theorem: log z P(z) f(z) ≥ z P(z) log f(z)

Page 26: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 26

Applying Jensen’s inequality

Use: log z P(z) f(z) ≥ z P(z) log f(z)

Page 27: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 27

The M-step maximizes lower bound on weighted data

Lower bound from Jensen’s:

Corresponds to weighted dataset: <x(1),z=1> with weight Q(t+1)(z=1|x(1)) <x(1),z=2> with weight Q(t+1)(z=2|x(1)) <x(1),z=3> with weight Q(t+1)(z=3|x(1)) <x(2),z=1> with weight Q(t+1)(z=1|x(2)) <x(2),z=2> with weight Q(t+1)(z=2|x(2)) <x(2),z=3> with weight Q(t+1)(z=3|x(2)) …

Page 28: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 28

The M-step

Maximization step:

Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)]

Page 29: Parameter learning in  Markov nets

10-708 – Carlos Guestrin 2006-2008 29

Convergence of EM

Define potential function F(,Q):

EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester