scoring functions for learning bayesian networks · (codeword) with length ˇ log p l. len(d: n) =...

Scoring Functions Minimum Description Length (MDL) Bayesian Dirichlet Score Family Wrap-up

Scoring Functions for Learning Bayesian Networks

Brandon Malone

Much of this material is adapted from Suzuki 1993, Lam and Bacchus 1994, and Heckerman 1998

Many of the images were taken from the Internet

February 13, 2014

Brandon Malone Scoring Functions for Learning Bayesian Networks


Scoring Functions for Learning Bayesian Networks

Suppose we have two Bayesian network structure N1 and N2.

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

Which structure best explains a dataset D?

We will use scoring functions to rate each network.The one with the best score is “better.”



1 Scoring Functions

2 Minimum Description Length (MDL)

3 Bayesian Dirichlet Score Family

4 Wrap-up



Why do we want to learn structures?

Knowledge discovery (“interpretation”)

Density estimation (“prediction”)



Assumptions (generally)

Multinomial samples

Complete data

Parameter independence

GlobalLocal

Parameter modularity



Under and overfitting

Underfitting, too simple Overfitting, too complex Tradeoff, “just right”

What does it mean in Bayesian networks?



Minimum description length (MDL)

MDL∗ views learning as data compression.

Traditionally, MDL consists of two components.

Model encoding

Data encoding, using the model

A few properties

Formalizes Occam’s Razor

Works regardless of a “true” model



Avoiding overfitting with MDL

Short model encoding

Long data encoding

Long model encoding

Short data encoding

Medium model encoding

Medium data encoding

We will favor models which do not use too many bits to encodeeither the model or the data.



Encoding a Bayesian network

We must encode:

Parents of each node

We need log2 n bits for each parent.

Conditional probability parameters

We need (ri − 1) · qi parameters for Xi .

We need log2 N2 bits per parameter.

The total complexity is as follows.

n∑i

log n · |PAi |+logN

2· (ri − 1) · qi

Other encodings are possible.



Encoding data with a Bayesian network

Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .We can approximate this value using the counts from the data.

pl = P(Dl |D,N )

=n∏i

θijk:l Chain rule of BNs

=n∏i

Nijk:l

Nij :lUsing MLE parameters



Encoding data with a Bayesian network

Each complete instantiation Dl is assigned a binary string(codeword) with length ≈ − log pl .

len(D : N ) =N∑l

len(Dl : N )

=N∑l

− logn∏i

Nijk:l

Nij :l

= −N∑l

logn∏i

ri∏k

qi∏j

Nijk:l

Nij :l

= −N∑l

n∑i

qi∑j

ri∑k

logNijk:l

Nij :l

= −n∑i

qi∑j

ri∑k

Nijk · logNijk

Nij

This is the log-likelihood, `, of the data using the MLE parameters.



MDL as a scoring function

As derived here, the MDL score for a network N given a dataset Dis as follws.

MDL(N : D) = −n∑i

qi∑j

ri∑k

Nijk logNijk

Nij

+ log n · |PAi |+log N

2· (ri − 1) · qi

As the dataset (N) grows, the log n × |PAi | term vanishes, so themost commonly used version of MDL is as follows.


qi∑j

ri∑k

Nijk logNijk

Nij

+logN

2· (ri − 1) · qi


`(Xi |PAi ) +logN

2· (ri − 1) · qi



Bayesian Dirichlet (BD) Score Family

Suppose we would like to maximize the joint probability of thedata D and network N .

N ∗ = arg maxN

P(D,N )

= arg maxN

P(D|N )P(N )

We again have two parts.

Evaluation of the model, P(N )

Evaluation of data given the model, P(D|N )



Data given the model

We need to evaluate P(D|N ). We derived P(xl |D,N ) forparameter estimation.

P(xl |D,N ) =n∏i

qi∏j

ri∏k

αijk:l + nijk:l∑k (αijk:l + nijk:l)

Because the samples are iid, we can evaluate P(D|N ) by takingthe product.

P(D|N ) =N∏l

n∏i

qi∏j

ri∏k

αijk:l + nijk:l∑k (αijk:l + nijk:l)

=n∏i

qi∏j

Γ(αij)

Γ(αij + nij)

ri∏k

Γ(αijk + nijk)

Γ(αijk)



Score probability, P(D,N )

We are interested in the joint probability of D and N .

P(D,N ) = P(N )P(D|N )

= P(N )n∏i

qi∏j

Γ(αij)

Γ(αij + nij)

ri∏k

Γ(αijk + nijk)

Γ(αijk)

This is called the BD scoring function.If we set αijk = 1, then it is called the K2 metric.



Some desirable equivalences

Say N1and N2 are Markov equivalent.

Prior probabilities and equivalence

P(N1) = P(N2)

Likelihood probabilities and equivalence

P(D|N1) = P(D|N2)

Score probabilities and equivalence

P(D,N1) = P(D,N2)



Some desirable equivalences

Say N1and N2 are Markov equivalent.

Prior probabilities and equivalence

P(N1) = P(N2)

Likelihood probabilities and equivalence Not guaranteed by BD

P(D|N1) = P(D|N2)

Score probabilities and equivalence

P(D,N1) = P(D,N2)



BDe and BDeu

We can restrict the hyperparameters to ensure likelihoodequivalence. This is BDe.

αijk = α · P(Xi = k,PAi = j |N )

Typically, uninformative hyperparameters are used. This is BDeu.

αijk =α

ri · qi



The BDeu scoring function

We can incorporate our assumptions to derive the BDeu scoringfunction.

P(D,N ) = P(N )P(D|N ) Rewrite using chain rule

= P(N )n∏i

qi∏j

Γ(αij )

Γ(αij + nij )

ri∏k

Γ(αijk + nijk )

Γ(αijk )Substitute probability of data

∝n∏i

qi∏j

Γ(αij )

Γ(αij + nij )

ri∏k

Γ(αijk + nijk )

Γ(αijk )Assume a uniform structure prior

∝n∏i

qi∏j

Γ( αqi

)

Γ( αqi

+ nij )

ri∏k

Γ( αri ·qi

+ nijk )

Γ( αri ·qi

)Replace the αs

BDeu(N : D, α) =n∑i

qi∑j

logΓ( α

qi)

Γ( αqi

+ nij )+

ri∑k

logΓ( α

ri ·qi+ nijk )

Γ( αri ·qi

)Work in log-space


qi∑j

log Γ(α

qi)− log Γ(

α

qi+ nij )+ Remove divisions

ri∑k

log Γ(α

ri · qi+ nijk )− log Γ(

α

ri · qi)



Decomposability

Both MDL and BD are decomposable: a sum over terms whichinvolve only a variable and its parents.


{`(Xi |PAi ) +

log N

2· (ri − 1) · qi

}


qi∑j

log Γ(α

qi)− log Γ(

α

qi+ nij )+

ri∑k

log Γ(α

ri · qi+ nijk )− log Γ(

α

ri · qi)

}

What does it mean when we evaluate different structures?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?

(C)

Rain?

Winter?

(A)

(E)

Slippery Road?

(D)

Wet Grass?

(B)

Sprinkler?



Limitations of scoring functions

Parameter independence is violated if data is missing.

Experimental data is different that observational data.

(MDL) When do we use asymptotics?

(BD) How do we specify α and P(N )?



Recap

During this part of the course, we have discussed:

Overfitting

Minimum description length scoring function for BNs

BD family of scores for BNs



Next in probabilistic models

We will discuss two strategies for learning Bayesian networkstructures.

A greedy hill climbing algorithm which finds local optima

A dynamic programming algorithm which guarantees to findan optimal network


scoring functions for learning bayesian networks · (codeword) with length ˇ log p l. len(d: n) =...

Documents