scoring functions for learning bayesian networks · (codeword) with length ˇ log p l. len(d: n) =...

24
Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up Scoring Functions for Learning Bayesian Networks Brandon Malone Much of this material is adapted from Suzuki 1993, Lam and Bacchus 1994, and Heckerman 1998 Many of the images were taken from the Internet February 13, 2014 Brandon Malone Scoring Functions for Learning Bayesian Networks

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Scoring Functions for Learning Bayesian Networks

Brandon Malone

Much of this material is adapted from Suzuki 1993, Lam and Bacchus 1994, and Heckerman 1998

Many of the images were taken from the Internet

February 13, 2014

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 2: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Scoring Functions for Learning Bayesian Networks

Suppose we have two Bayesian network structure N1 and N2.

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

Which structure best explains a dataset D?

We will use scoring functions to rate each network.The one with the best score is “better.”

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 3: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Scoring Functions for Learning Bayesian Networks

Suppose we have two Bayesian network structure N1 and N2.

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

Which structure best explains a dataset D?

We will use scoring functions to rate each network.The one with the best score is “better.”

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 4: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

1 Scoring Functions

2 Minimum Description Length (MDL)

3 Bayesian Dirichlet Score Family

4 Wrap-up

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 5: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Why do we want to learn structures?

Knowledge discovery (“interpretation”)

Density estimation (“prediction”)

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 6: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Assumptions (generally)

Multinomial samples

Complete data

Parameter independence

GlobalLocal

Parameter modularity

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 7: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Under and overfitting

Underfitting, too simple Overfitting, too complex Tradeoff, “just right”

What does it mean in Bayesian networks?

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 8: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Minimum description length (MDL)

MDL∗ views learning as data compression.

Traditionally, MDL consists of two components.

Model encoding

Data encoding, using the model

A few properties

Formalizes Occam’s Razor

Works regardless of a “true” model

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 9: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Avoiding overfitting with MDL

Short model encoding

Long data encoding

Long model encoding

Short data encoding

Medium model encoding

Medium data encoding

We will favor models which do not use too many bits to encodeeither the model or the data.

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 10: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Encoding a Bayesian network

We must encode:

Parents of each node

We need log2 n bits for each parent.

Conditional probability parameters

We need (ri − 1) · qi parameters for Xi .

We need log2 N2 bits per parameter.

The total complexity is as follows.

n∑i

log n · |PAi |+logN

2· (ri − 1) · qi

Other encodings are possible.

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 11: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Encoding data with a Bayesian network

Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .We can approximate this value using the counts from the data.

pl = P(Dl |D,N )

=n∏i

θijk:l Chain rule of BNs

=n∏i

Nijk:l

Nij :lUsing MLE parameters

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 12: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Encoding data with a Bayesian network

Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .

len(D : N ) =N∑l

len(Dl : N )

=N∑l

− logn∏i

Nijk:l

Nij :l

= −N∑l

logn∏i

ri∏k

qi∏j

Nijk:l

Nij :l

= −N∑l

n∑i

qi∑j

ri∑k

logNijk:l

Nij :l

= −n∑i

qi∑j

ri∑k

Nijk · logNijk

Nij

This is the log-likelihood, `, of the data using the MLE parameters.

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 13: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

MDL as a scoring function

As derived here, the MDL score for a network N given a dataset Dis as follws.

MDL(N : D) = −n∑i

qi∑j

ri∑k

Nijk logNijk

Nij

+ log n · |PAi |+log N

2· (ri − 1) · qi

As the dataset (N) grows, the log n × |PAi | term vanishes, so themost commonly used version of MDL is as follows.

MDL(N : D) = −n∑i

qi∑j

ri∑k

Nijk logNijk

Nij

+logN

2· (ri − 1) · qi

MDL(N : D) = −n∑i

`(Xi |PAi ) +logN

2· (ri − 1) · qi

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 14: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Bayesian Dirichlet (BD) Score Family

Suppose we would like to maximize the joint probability of thedata D and network N .

N ∗ = arg maxN

P(D,N )

= arg maxN

P(D|N )P(N )

We again have two parts.

Evaluation of the model, P(N )

Evaluation of data given the model, P(D|N )

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 15: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Data given the model

We need to evaluate P(D|N ). We derived P(xl |D,N ) forparameter estimation.

P(xl |D,N ) =n∏i

qi∏j

ri∏k

αijk:l + nijk:l∑k (αijk:l + nijk:l)

Because the samples are iid, we can evaluate P(D|N ) by takingthe product.

P(D|N ) =N∏l

n∏i

qi∏j

ri∏k

αijk:l + nijk:l∑k (αijk:l + nijk:l)

=n∏i

qi∏j

Γ(αij)

Γ(αij + nij)

ri∏k

Γ(αijk + nijk)

Γ(αijk)

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 16: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Score probability, P(D,N )

We are interested in the joint probability of D and N .

P(D,N ) = P(N )P(D|N )

= P(N )n∏i

qi∏j

Γ(αij)

Γ(αij + nij)

ri∏k

Γ(αijk + nijk)

Γ(αijk)

This is called the BD scoring function.If we set αijk = 1, then it is called the K2 metric.

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 17: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Some desirable equivalences

Say N1and N2 are Markov equivalent.

Prior probabilities and equivalence

P(N1) = P(N2)

Likelihood probabilities and equivalence

P(D|N1) = P(D|N2)

Score probabilities and equivalence

P(D,N1) = P(D,N2)

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 18: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Some desirable equivalences

Say N1and N2 are Markov equivalent.

Prior probabilities and equivalence

P(N1) = P(N2)

Likelihood probabilities and equivalence Not guaranteed by BD

P(D|N1) = P(D|N2)

Score probabilities and equivalence

P(D,N1) = P(D,N2)

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 19: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

BDe and BDeu

We can restrict the hyperparameters to ensure likelihoodequivalence. This is BDe.

αijk = α · P(Xi = k,PAi = j |N )

Typically, uninformative hyperparameters are used. This is BDeu.

αijk =α

ri · qi

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 20: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

The BDeu scoring function

We can incorporate our assumptions to derive the BDeu scoringfunction.

P(D,N ) = P(N )P(D|N ) Rewrite using chain rule

= P(N )n∏i

qi∏j

Γ(αij )

Γ(αij + nij )

ri∏k

Γ(αijk + nijk )

Γ(αijk )Substitute probability of data

∝n∏i

qi∏j

Γ(αij )

Γ(αij + nij )

ri∏k

Γ(αijk + nijk )

Γ(αijk )Assume a uniform structure prior

∝n∏i

qi∏j

Γ( αqi

)

Γ( αqi

+ nij )

ri∏k

Γ( αri ·qi

+ nijk )

Γ( αri ·qi

)Replace the αs

BDeu(N : D, α) =n∑i

qi∑j

logΓ( α

qi)

Γ( αqi

+ nij )+

ri∑k

logΓ( α

ri ·qi+ nijk )

Γ( αri ·qi

)Work in log-space

BDeu(N : D, α) =n∑i

qi∑j

log Γ(α

qi)− log Γ(

α

qi+ nij )+ Remove divisions

ri∑k

log Γ(α

ri · qi+ nijk )− log Γ(

α

ri · qi)

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 21: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Decomposability

Both MDL and BD are decomposable: a sum over terms whichinvolve only a variable and its parents.

MDL(N : D) = −n∑i

{`(Xi |PAi ) +

log N

2· (ri − 1) · qi

}

BDeu(N : D, α) =n∑i

qi∑j

log Γ(α

qi)− log Γ(

α

qi+ nij )+

ri∑k

log Γ(α

ri · qi+ nijk )− log Γ(

α

ri · qi)

}

What does it mean when we evaluate different structures?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 22: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Limitations of scoring functions

Parameter independence is violated if data is missing.

Experimental data is different that observational data.

(MDL) When do we use asymptotics?

(BD) How do we specify α and P(N )?

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 23: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Recap

During this part of the course, we have discussed:

Overfitting

Minimum description length scoring function for BNs

BD family of scores for BNs

Brandon Malone Scoring Functions for Learning Bayesian Networks

Page 24: Scoring Functions for Learning Bayesian Networks · (codeword) with length ˇ log p l. len(D: N) = XN l len(D l: N) = XN l log Yn i N ijk:l N ij:l = XN l log Yn i Yri k qi j N ijk:l

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Next in probabilistic models

We will discuss two strategies for learning Bayesian networkstructures.

A greedy hill climbing algorithm which finds local optima

A dynamic programming algorithm which guarantees to findan optimal network

Brandon Malone Scoring Functions for Learning Bayesian Networks