structure learning in bayesian networks eran segal weizmann institute

Structure Learningin Bayesian Networks

Eran Segal Weizmann Institute

Structure Learning Motivation Network structure is often unknown

Purposes of structure learning Discover the dependency structure of the domain

Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations

Set expectations: at best, we can recover the structure up to the I-equivalence class

Density estimation Estimate a statistical model of the underlying distribution

and use it to reason with and predict new instances

Advantages of Accurate Structure

X1

Y

X2

X1

Y

X2X1

Y

X2

Spurious edge

Missing edge

Increases number of fitted parameters

Wrong causality and domain structure assumptions

Cannot be compensated by parameter estimation

Wrong causality and domain structure assumptions

Structure Learning Approaches Constraint based methods

View the Bayesian network as representing dependencies

Find a network that best explains dependencies Limitation: sensitive to errors in single dependencies

Score based approaches View learning as a model selection problem

Define a scoring function specifying how well the model fits the data

Search for a high-scoring network structure Limitation: super-exponential search space

Bayesian model averaging methods Average predictions across all possible structures Can be done exactly (some cases) or approximately

Constraint Based Approaches Goal: Find the best minimal I-Map for the

domain G is an I-Map for P if I(G)I(P)

Minimal I-Map if deleting an edge from G renders it not an I-Map

G is a P-Map for P if I(G)=I(P)

Strategy Query the distribution for independence

relationships that hold between sets of variables Construct a network which is the best minimal I-Map

for P

Constructing Minimal I-Maps Reverse factorization theorem

G is an I-Map of P

Algorithm for constructing a minimal I-Map Fix an ordering of nodes X1,…,Xn

Select parents of Xi as minimal subset of X1,…,Xi-1,such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))

(Outline of) Proof of minimal I-map I-map since the factorization above holds by

construction Minimal since by construction, removing one edge

destroys the factorization

n

iiin XPaXPXXP

11 ))(|(),...,(

Limitations Independence queries involve a large number of variables

Construction involves a large number of queries (2i-1 subsets)

We do not know the ordering and network is sensitive to it

Constructing P-Maps Simplifying assumptions

Network has bound in-degree d per node Oracle can answer Ind. queries of up to 2d+2

variables Distribution P has a P-Map

Algorithm Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges

Step I: Identifying the Skeleton For each pair X,Y query all Z for Ind(X;Y | Z)

X–Y is in skeleton if no Z is found If graph in-degree bounded by d running time

O(n2d+2) Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))

Reminder If there is no Z for which Ind(X;Y | Z) holds,

then XY or YX in G* Proof: Assume no Z exists, and G* does not have XY or YX Then, can find a set Z such that the path from X to Y is

blocked Then, G* implies Ind(X;Y | Z) and since G* is a P-Map Contradiction

Step II: Identifying Immoralities For each pair X,Y query candidate triplets X,Y,Z

XZY if no W is found that contains Z and Ind(X;Y | W) If graph in-degree bounded by d running time O(n2d+3)

If W exists, Ind(X;Y|W), and XZY not immoral, then ZW

Reminder If there is no W such that Z is in W and Ind(X;Y | W),

then XZY is an immorality Proof: Assume no W exists but X–Z–Y is not an immorality Then, either XZY or XZY or XZY exists But then, we can block X–Z–Y by Z Then, since X and Y are not connected, can find W that includes

Z such that Ind(X;Y | W) Contradiction

Answering Independence Queries

Basic query Determine whether two variables are independent Well studied question in statistics

Common to frame query as hypothesis testing Null hypothesis is H0

H0: Data was sampled from P*(X,Y)=P*(X)P*(Y) Need a procedure that will Accept or Reject the

hypothesis 2 test to assess deviance of data from hypothesis

Alternatively, use mutual information between X and Y

yx yPxPM

yPxPMyxMDd

,^^

^^

)()(

)()(],[)(2

Structure Based Approaches Strategy

Define a scoring function for each candidate structure

Search for a high scoring structure

Key: choice of scoring function Likelihood based scores Bayesian based scores

Likelihood Scores Goal: find (G,) that maximize the likelihood

ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G Find G that maximizes ScoreL(G:D)

Example

X

Y

X

Y

m

mxmymxL DGScore ][|][][1ˆlogˆlog):(

mmymxL DGScore ][][0

ˆlogˆlog):(

),(

)(ˆlog),(ˆ)|(ˆlog),(ˆ

)(ˆlog)(ˆ)|(ˆlog),(ˆ

ˆlog][ˆlog],[

ˆlogˆlog):():(

ˆ

, ,

,

,|

][][|][01

YXM

yPyxPMxyPyxPM

yPyPMxyPyxPM

yMyxM

DGScoreDGScore

P

yx yx

yx y

yx yyxy

mmymxmyLL

I

General Decomposition The Likelihood score decomposes as:

Proof:

n

i PaValu xuxiiL

GiXi i

iiuxMDGScore

1 )(|ˆlog],[):(

n

i

n

iiP

GXiPL XMPaXMDGScore

i1 1

ˆˆ )(),():( HI

)(),(

)(ˆlog)(ˆ),(

)(ˆlog),(ˆ)(ˆ)(ˆ),(ˆ

log),(ˆ

)(ˆ)(ˆ)(ˆ),(ˆ

log),(ˆ

)|(ˆlog),(ˆˆlog],[1

ˆˆ

ˆ

|

iPiiP

xiiiiP

xi

uii

u x ii

iiii

u x ii

iiiii

u xiiii

u xuxii

XUX

xPxPUX

xPuxPxPuP

uxPuxP

xPuP

xPuxPuxP

uxPuxPuxMM

i

i ii i

i i

i ii i

ii

HI

I

General Decomposition The Likelihood score decomposes as:

Second term does not depend on network structure and thus is irrelevant for selecting between two structures

Score increases as mutual information, or strength of dependence between connected variable increases

After some manipulation can show:

To what extent are the implied Markov assumptions valid

n

i

n

iiP

GXiPL XMPaXMDGScore

i1 1

ˆˆ )(),():( HI

n

i

GX

GXiiPnPL ii

PaPaXXXXXDGScore1

11ˆ1ˆ )|},...{,(),...,():( IH

Limitations of Likelihood Score

X

Y

X

Y

),():():( ˆ01 YXMDGScoreDGScorePLL I

G1G0

Since IP(X,Y)0 ScoreL(G1:D)ScoreL(G0:D)

Adding arcs always helps

Maximal scores attained for fully connected network

Such networks overfit the data (i.e., fit the noise in the data)

Avoiding Overfitting Classical problem in machine learning

Solutions Restricting the hypotheses space

Limits the overfitting capability of the learner Example: restrict # of parents or # of parameters

Minimum description length Description length measures complexity Prefer models that compactly describes the training data

Bayesian methods Average over all possible parameter values Use prior knowledge

Bayesian Score

Marginal likelihood Prior over structures

)(

)()|()|(

DP

GPGDPDGP

Marginal probability of Data

P(D) does not depend on the network

Bayesian Score:

)(log)|(log):( GPGDPDGScoreB

Marginal Likelihood of Data Given G

Bayesian Score:


Likelihood Prior over parameters

GGG dGPGDPGDPG

)|(),|()|(

Note similarity to maximum likelihood score, but with the key difference that ML finds maximum of likelihood and here we compute average of the terms over parameter space

Marginal Likelihood: Binomial Case

Assume a sequence of m coin tosses By the chain rule for probabilities

Recall that for Dirichlet priors

Where MmH is number of heads in first m examples

])1[,],1[|][(...])1[(])[,],1[( mxxmxPxPmxxP

TH

HmH

m

MmxxHmxP

])[,],1[|]1[(

)1(...

)]1(...)][1(...[])[],...,1[(

M

MMmxxP TTTHHH

Marginal Likelihood: Binomial Case

)1(...

)]1(...)][1(...[])[],...,1[(

M

MMmxxP TTTHHH

)(

)()1()1)((

M

M

Simplify using(x+1)=x(x)

)(

)(

)(

)(

)(

)(])[],...,1[(

T

TT

H

HH MM

MmxxP

k

i i

ii xM

MmxxP

1 )(

])[(

)(

)(])[],...,1[(

For multinomials with Dirichlet prior

Actual experiment with P(H) = 0.25

-1.3

-1.2

-1.1

-1

-0.9

-0.8

-0.7

-0.6

0 5 10 15 20 25 30 35 40 45 50

(log P

(D))

/ M

M

Dirichlet(.5,.5)Dirichlet(1,1)Dirichlet(5,5)

Marginal Likelihood Example

H T T H T H H

H T H H T T H

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7

Network 1: Two Dirichlet marginal likelihoods

P(X[1],…,X[7])P(Y[1],…,Y[7])

Marginal Likelihood: BayesNets

X

Y

Network

H T T H T H H

H T H H T T H

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7


Network 2: Three Dirichlet marginal likelihoods

P(X[1],…,X[7])P(Y[1]Y[4]Y[6]Y[7])P(Y[2]Y[3]Y[5])

X

Y

Network

P(X = H) = 0.5 P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p

-1.8

-1.75

-1.7

-1.65

-1.6

-1.55

-1.5

-1.45

-1.4

-1.35

-1.3

1 10 100 1000

IndependentP = 0.05P = 0.10P = 0.15P = 0.20

(logP(D

))/M

M

Idealized Experiment

The marginal likelihood has the form:

where M(..) are the counts from the data (..) are hyperparameters for each family

i pa x pax

Gipax

G

pa

pa

Gi i G

ii

iGii

iGi

Gi

paxM

paMGDP

)(

]),[(

][)|(

,

,

Dirichlet Marginal LikelihoodFor the sequence of values of Xi when

Xi’s parents have a particular value


Priors

Structure prior P(G) Uniform prior: P(G) constant Prior penalizing number of edges: P(G) c|G|

(0<c<1) Normalizing constant across networks is similar and

can thus be ignored

Bayesian Score:


Priors

Parameter prior P(|G) BDe prior

M0: equivalent sample size B0: network representing the prior probability of events Set (xi,pai

G) = M0 P(xi,paiG| B0)

Note: paiG are not the same as parents of Xi in B0

Compute P(xi,paiG| B0) using standard inference in B0

BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’

Bayesian Score:


Bayesian Score: Asymptotic Behavior

For M, a network G with Dirichlet priors satisfies

Approximation is called BIC score

Score exhibits tradeoff between fit to data and complexity

Mutual information grows linearly with M while complexity grows logarithmically with M As M grows, more emphasis is given to the fit to the data

)1()(2

log):ˆ()|(log OGDim

MDlGDP G

Dim(G): number of independent parameters in G

)(2

log)(),(

)(2

log):ˆ():(

1ˆ

1ˆ GDim

MXMPaXM

GDimM

DlDGScoren

iiP

n

iXiP

GBIC

i

HI

Bayesian Score: Asymptotic Behavior

For M, a network G with Dirichlet priors satisfies

Bayesian score is consistent As M, the true structure G* maximizes the score

Spurious edges will not contribute to likelihood and will be penalized

Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity

)1()(2

log):ˆ()|(log OGDim

MDlGDP G

Summary: Network Scores Likelihood, MDL, (log) BDe have the form

BDe requires assessing prior network Can naturally incorporate prior knowledge

BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL

All are score-equivalent G I-equivalent to G’ Score(G) = Score(G’)

):|():( i

Gi DPaXScoreDGScore

i

Optimization ProblemInput: Training data Scoring function (including priors, if needed) Set of possible structures

Including prior knowledge about structure

Output: A network (or networks) that maximize the score

Key Property: Decomposability: the score of a network is a sum of

terms. ):|():( i

Gi DPaXScoreDGScore

i

Trees At most one parent per variable

Why trees? Elegant math

we can solve the optimization problem efficiently(with a greedy algorithm)

Sparse parameterization avoid overfitting while adapting to the data

Learning Trees

Let p(i) denote parent of Xi, or 0 if Xi has no parent

We can write the score as

Score = sum of edge scores + constant

ii

ipiiipi

ipii

ipiipi

iii

XScoreXScoreXXScore

XScoreXXScore

PaXScoreDGScore

)()():(

)():(

):():(

0)(:)(

0)(:0)(:)(

Score of “empty”network

Improvement over “empty” network

Learning Trees

Algorithm Construct graph with vertices: 1,...n Set w(ij) = Score(Xj | Xi ) - Score(Xj) Find tree (or forest) with maximal weight

This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion(Kruskal’s maximum spanning tree algorithm)

Theorem: Procedure finds the tree with maximal score

When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method

Learning Trees

Not every edge in tree is in the the original networkTree direction is arbitrary --- we can’t learn about arc direction

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Learning Trees: Example

Correct edges

Spurious edges

Tree learned from data of Alarm network

Beyond Trees Problem is not easy for more complex networks

Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network

In fact, no efficient algorithm exists Theorem:

Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1

Fixed Ordering For any decomposable scoring function

Score(G:D) and ordering the maximal scoring network has:

For fixed ordering we have independent problems

If we bound the in-degree per variable by d, then complexity is exponential in d

):|(maxarg }:{ DXScorePa iXXXGi ijj iU U

(since choice at Xi does not constrain other choices)

We address the problem by using heuristic search Define a search space:

nodes are possible structures edges denote adjacency of structures

Traverse this space looking for high-scoring structures

Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...

Heuristic Search

Typical operations:

Reverse CBDelete

B C

Add B

D

Heuristic Search

B

C

A

DB

C

A

D B

C

A

D

B

C

A

D

Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move

Exploiting Decomposability

B

C

A

D B

C

A

D

B

C

A

D

B

C

A

D

Simplest heuristic local search Start with a given network

empty network best tree a random network

At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate

Stop when no modification improves score Each step requires evaluating O(n) new

changes

Greedy Hill Climbing

Greedy Hill-Climbing can get stuck in: Local Maxima

All one-edge changes reduce the score Plateaus

Some one-edge changes leave the score unchanged Happens because equivalent networks received the same

score and are neighbors in the search space

Both occur during structure search Standard heuristics can escape both

Random restarts TABU search

Greedy Hill Climbing Pitfalls

Idea Search the space of equivalence classes Equivalence classes can be represented by PDAGs

(partially ordered graph)

Advantages PDAGs space has fewer local maxima and plateaus There are fewer PDAGs than DAGs

Equivalence Class Search

Evaluating changes is more expensive In addition to search, need to score a consistent network

These algorithms are more complex to implement

Add Y—Z

Original PDAG

New PDAG

Consistent DAG

Score

Equivalence Class Search

ZYX

ZYX

ZYX

0

0.5

1

1.5

2

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erge

nce

M

True Structure/BDe M' = 10Unknown Structure/BDe M' = 10

Learning Example: Alarm Network

Model Selection So far, we focused on single model

Find best scoring model Use it to predict next example

Implicit assumption: Best scoring model dominates the weighted sum

Valid with many data instances

Pros: We get a single structure Allows for efficient use in our tasks

Cons: We are committing to the independencies of a particular

structure Other structures might be as probable given the data

Model Selection Density estimation

One structure may suffice, if its joint distribution is similar to other high scoring structures

Structure discovery Define features f(G) (e.g., edge, sub-structure, d-sep

query) Compute

Still requires summing over exponentially many structures

G

DGPGfDfP )|()()|(

Model Averaging Given an Order

Assumptions Known total order of variables Maximum in-degree for variables d

Marginal likelihood

i dXiiB

GG i

GXiB

GG

i

d

i

d

DUXFamScore

DPaXFamScore

GPGDPDP

}||,:{

,

,

:|(exp

:|(exp

)|()|()|(

UUUU

Using decomposability assumption on prior

P(G|)

Since given ordering , parent choices are

independentCost per family: O(nk)

Total cost: O(nk+1)

Model Averaging Given an Order

Posterior probability of a general feature

Posterior probability of particular choice of parents

Posterior probability of particular edge choice

}||,:{

:|(exp

:|(exp),|(

dXiiB

iBGX

i

i DUXFamScore

DUXFamScoreDPaP

UUUU

U

All terms cancel

out

i dXiiB

GG

i

d

DUXFamScore

GPGDPGf

DP

DfPDfP

}||,:{

,

:|(exp

)|()|()(

)|(

)|,(),|(

UUUU

}||,:{

}||,:{

:|(exp

:|(exp

),|(

dXiiB

dXandXiiB

GXj

i

ij

i DUXFamScore

DUXFamScore

DPaXP

UUUU

UUUUU

Model Averaging We cannot assume that order is known

Solution: Sample from posterior distribution of P(G|D) Estimate feature probability by Sampling can be done by MCMC Sampling can be done over orderings

K

idGf

KDfP

1

)(1

)|(

Notes on Learning Local Structures

Define score with local structures Example: in tree CPDs, score decomposes by leaves

Prior may need to be extended Example: in tree CPDs, penalty for tree structure per

CPD

Extend search operators to local structure Example: in tree CPDs, we need to search for tree

structure Can be done by local encapsulated search or by

defining new global operations

Search: Summary

Discrete optimization problem In general, NP-Hard

Need to resort to heuristic search In practice, search is relatively fast (~100 vars in

~10 min): Decomposability Sufficient statistics

In some cases, we can reduce the search problem to an easy optimization problem Example: learning trees

structure learning in bayesian networks eran segal weizmann institute

Documents

imap g

best minimal imap

p slide

ig ip minimal imap

pmap contradiction slide

minimal subset of x

pmap algorithm step

parents of x