structure learning in bayesian networks eran segal weizmann institute
Post on 20-Dec-2015
215 views
TRANSCRIPT
Structure Learningin Bayesian Networks
Eran Segal Weizmann Institute
Structure Learning Motivation Network structure is often unknown
Purposes of structure learning Discover the dependency structure of the domain
Goes beyond statistical correlations between individual variables and detects direct vs. indirect correlations
Set expectations: at best, we can recover the structure up to the I-equivalence class
Density estimation Estimate a statistical model of the underlying distribution
and use it to reason with and predict new instances
Advantages of Accurate Structure
X1
Y
X2
X1
Y
X2X1
Y
X2
Spurious edge
Missing edge
Increases number of fitted parameters
Wrong causality and domain structure assumptions
Cannot be compensated by parameter estimation
Wrong causality and domain structure assumptions
Structure Learning Approaches Constraint based methods
View the Bayesian network as representing dependencies
Find a network that best explains dependencies Limitation: sensitive to errors in single dependencies
Score based approaches View learning as a model selection problem
Define a scoring function specifying how well the model fits the data
Search for a high-scoring network structure Limitation: super-exponential search space
Bayesian model averaging methods Average predictions across all possible structures Can be done exactly (some cases) or approximately
Constraint Based Approaches Goal: Find the best minimal I-Map for the
domain G is an I-Map for P if I(G)I(P)
Minimal I-Map if deleting an edge from G renders it not an I-Map
G is a P-Map for P if I(G)=I(P)
Strategy Query the distribution for independence
relationships that hold between sets of variables Construct a network which is the best minimal I-Map
for P
Constructing Minimal I-Maps Reverse factorization theorem
G is an I-Map of P
Algorithm for constructing a minimal I-Map Fix an ordering of nodes X1,…,Xn
Select parents of Xi as minimal subset of X1,…,Xi-1,such that Ind(Xi ; X1,…Xi-1 – Pa(Xi) | Pa(Xi))
(Outline of) Proof of minimal I-map I-map since the factorization above holds by
construction Minimal since by construction, removing one edge
destroys the factorization
n
iiin XPaXPXXP
11 ))(|(),...,(
Limitations Independence queries involve a large number of variables
Construction involves a large number of queries (2i-1 subsets)
We do not know the ordering and network is sensitive to it
Constructing P-Maps Simplifying assumptions
Network has bound in-degree d per node Oracle can answer Ind. queries of up to 2d+2
variables Distribution P has a P-Map
Algorithm Step I: Find skeleton Step II: Find immoral set of v-structures Step III: Direct constrained edges
Step I: Identifying the Skeleton For each pair X,Y query all Z for Ind(X;Y | Z)
X–Y is in skeleton if no Z is found If graph in-degree bounded by d running time
O(n2d+2) Since if no direct edge exists, Ind(X;Y | Pa(X), Pa(Y))
Reminder If there is no Z for which Ind(X;Y | Z) holds,
then XY or YX in G* Proof: Assume no Z exists, and G* does not have XY or YX Then, can find a set Z such that the path from X to Y is
blocked Then, G* implies Ind(X;Y | Z) and since G* is a P-Map Contradiction
Step II: Identifying Immoralities For each pair X,Y query candidate triplets X,Y,Z
XZY if no W is found that contains Z and Ind(X;Y | W) If graph in-degree bounded by d running time O(n2d+3)
If W exists, Ind(X;Y|W), and XZY not immoral, then ZW
Reminder If there is no W such that Z is in W and Ind(X;Y | W),
then XZY is an immorality Proof: Assume no W exists but X–Z–Y is not an immorality Then, either XZY or XZY or XZY exists But then, we can block X–Z–Y by Z Then, since X and Y are not connected, can find W that includes
Z such that Ind(X;Y | W) Contradiction
Answering Independence Queries
Basic query Determine whether two variables are independent Well studied question in statistics
Common to frame query as hypothesis testing Null hypothesis is H0
H0: Data was sampled from P*(X,Y)=P*(X)P*(Y) Need a procedure that will Accept or Reject the
hypothesis 2 test to assess deviance of data from hypothesis
Alternatively, use mutual information between X and Y
yx yPxPM
yPxPMyxMDd
,^^
^^
)()(
)()(],[)(2
Structure Based Approaches Strategy
Define a scoring function for each candidate structure
Search for a high scoring structure
Key: choice of scoring function Likelihood based scores Bayesian based scores
Likelihood Scores Goal: find (G,) that maximize the likelihood
ScoreL(G:D)=log P(D | G, ’G) where ’G is MLE for G Find G that maximizes ScoreL(G:D)
Example
X
Y
X
Y
m
mxmymxL DGScore ][|][][1ˆlogˆlog):(
mmymxL DGScore ][][0
ˆlogˆlog):(
),(
)(ˆlog),(ˆ)|(ˆlog),(ˆ
)(ˆlog)(ˆ)|(ˆlog),(ˆ
ˆlog][ˆlog],[
ˆlogˆlog):():(
ˆ
, ,
,
,|
][][|][01
YXM
yPyxPMxyPyxPM
yPyPMxyPyxPM
yMyxM
DGScoreDGScore
P
yx yx
yx y
yx yyxy
mmymxmyLL
I
General Decomposition The Likelihood score decomposes as:
Proof:
n
i PaValu xuxiiL
GiXi i
iiuxMDGScore
1 )(|ˆlog],[):(
n
i
n
iiP
GXiPL XMPaXMDGScore
i1 1
ˆˆ )(),():( HI
)(),(
)(ˆlog)(ˆ),(
)(ˆlog),(ˆ)(ˆ)(ˆ),(ˆ
log),(ˆ
)(ˆ)(ˆ)(ˆ),(ˆ
log),(ˆ
)|(ˆlog),(ˆˆlog],[1
ˆˆ
ˆ
|
iPiiP
xiiiiP
xi
uii
u x ii
iiii
u x ii
iiiii
u xiiii
u xuxii
XUX
xPxPUX
xPuxPxPuP
uxPuxP
xPuP
xPuxPuxP
uxPuxPuxMM
i
i ii i
i i
i ii i
ii
HI
I
General Decomposition The Likelihood score decomposes as:
Second term does not depend on network structure and thus is irrelevant for selecting between two structures
Score increases as mutual information, or strength of dependence between connected variable increases
After some manipulation can show:
To what extent are the implied Markov assumptions valid
n
i
n
iiP
GXiPL XMPaXMDGScore
i1 1
ˆˆ )(),():( HI
n
i
GX
GXiiPnPL ii
PaPaXXXXXDGScore1
11ˆ1ˆ )|},...{,(),...,():( IH
Limitations of Likelihood Score
X
Y
X
Y
),():():( ˆ01 YXMDGScoreDGScorePLL I
G1G0
Since IP(X,Y)0 ScoreL(G1:D)ScoreL(G0:D)
Adding arcs always helps
Maximal scores attained for fully connected network
Such networks overfit the data (i.e., fit the noise in the data)
Avoiding Overfitting Classical problem in machine learning
Solutions Restricting the hypotheses space
Limits the overfitting capability of the learner Example: restrict # of parents or # of parameters
Minimum description length Description length measures complexity Prefer models that compactly describes the training data
Bayesian methods Average over all possible parameter values Use prior knowledge
Bayesian Score
Marginal likelihood Prior over structures
)(
)()|()|(
DP
GPGDPDGP
Marginal probability of Data
P(D) does not depend on the network
Bayesian Score:
)(log)|(log):( GPGDPDGScoreB
Marginal Likelihood of Data Given G
Bayesian Score:
)(log)|(log):( GPGDPDGScoreB
Likelihood Prior over parameters
GGG dGPGDPGDPG
)|(),|()|(
Note similarity to maximum likelihood score, but with the key difference that ML finds maximum of likelihood and here we compute average of the terms over parameter space
Marginal Likelihood: Binomial Case
Assume a sequence of m coin tosses By the chain rule for probabilities
Recall that for Dirichlet priors
Where MmH is number of heads in first m examples
])1[,],1[|][(...])1[(])[,],1[( mxxmxPxPmxxP
TH
HmH
m
MmxxHmxP
])[,],1[|]1[(
)1(...
)]1(...)][1(...[])[],...,1[(
M
MMmxxP TTTHHH
Marginal Likelihood: Binomial Case
)1(...
)]1(...)][1(...[])[],...,1[(
M
MMmxxP TTTHHH
)(
)()1()1)((
M
M
Simplify using(x+1)=x(x)
)(
)(
)(
)(
)(
)(])[],...,1[(
T
TT
H
HH MM
MmxxP
k
i i
ii xM
MmxxP
1 )(
])[(
)(
)(])[],...,1[(
For multinomials with Dirichlet prior
Actual experiment with P(H) = 0.25
-1.3
-1.2
-1.1
-1
-0.9
-0.8
-0.7
-0.6
0 5 10 15 20 25 30 35 40 45 50
(log P
(D))
/ M
M
Dirichlet(.5,.5)Dirichlet(1,1)Dirichlet(5,5)
Marginal Likelihood Example
H T T H T H H
H T H H T T H
X
Y
Network structure determines form ofmarginal likelihood
1 2 3 4 5 6 7
Network 1: Two Dirichlet marginal likelihoods
P(X[1],…,X[7])P(Y[1],…,Y[7])
Marginal Likelihood: BayesNets
X
Y
Network
H T T H T H H
H T H H T T H
X
Y
Network structure determines form ofmarginal likelihood
1 2 3 4 5 6 7
Marginal Likelihood: BayesNets
Network 2: Three Dirichlet marginal likelihoods
P(X[1],…,X[7])P(Y[1]Y[4]Y[6]Y[7])P(Y[2]Y[3]Y[5])
X
Y
Network
P(X = H) = 0.5 P(Y = H|X = H) = 0.5 + p P(Y = H|X = T) = 0.5 - p
-1.8
-1.75
-1.7
-1.65
-1.6
-1.55
-1.5
-1.45
-1.4
-1.35
-1.3
1 10 100 1000
IndependentP = 0.05P = 0.10P = 0.15P = 0.20
(logP(D
))/M
M
Idealized Experiment
The marginal likelihood has the form:
where M(..) are the counts from the data (..) are hyperparameters for each family
i pa x pax
Gipax
G
pa
pa
Gi i G
ii
iGii
iGi
Gi
paxM
paMGDP
)(
]),[(
][)|(
,
,
Dirichlet Marginal LikelihoodFor the sequence of values of Xi when
Xi’s parents have a particular value
Marginal Likelihood: BayesNets
Priors
Structure prior P(G) Uniform prior: P(G) constant Prior penalizing number of edges: P(G) c|G|
(0<c<1) Normalizing constant across networks is similar and
can thus be ignored
Bayesian Score:
)(log)|(log):( GPGDPDGScoreB
Priors
Parameter prior P(|G) BDe prior
M0: equivalent sample size B0: network representing the prior probability of events Set (xi,pai
G) = M0 P(xi,paiG| B0)
Note: paiG are not the same as parents of Xi in B0
Compute P(xi,paiG| B0) using standard inference in B0
BDe has the desirable property that I-equivalent networks have the same Bayesian score when using the BDe prior for some M’ and P’
Bayesian Score:
)(log)|(log):( GPGDPDGScoreB
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
Approximation is called BIC score
Score exhibits tradeoff between fit to data and complexity
Mutual information grows linearly with M while complexity grows logarithmically with M As M grows, more emphasis is given to the fit to the data
)1()(2
log):ˆ()|(log OGDim
MDlGDP G
Dim(G): number of independent parameters in G
)(2
log)(),(
)(2
log):ˆ():(
1ˆ
1ˆ GDim
MXMPaXM
GDimM
DlDGScoren
iiP
n
iXiP
GBIC
i
HI
Bayesian Score: Asymptotic Behavior
For M, a network G with Dirichlet priors satisfies
Bayesian score is consistent As M, the true structure G* maximizes the score
Spurious edges will not contribute to likelihood and will be penalized
Required edges will be added due to linear growth of likelihood term relative to M compared to logarithmic growth of model complexity
)1()(2
log):ˆ()|(log OGDim
MDlGDP G
Summary: Network Scores Likelihood, MDL, (log) BDe have the form
BDe requires assessing prior network Can naturally incorporate prior knowledge
BDe is consistent and asymptotically equivalent (up to a constant) to BIC/MDL
All are score-equivalent G I-equivalent to G’ Score(G) = Score(G’)
):|():( i
Gi DPaXScoreDGScore
i
Optimization ProblemInput: Training data Scoring function (including priors, if needed) Set of possible structures
Including prior knowledge about structure
Output: A network (or networks) that maximize the score
Key Property: Decomposability: the score of a network is a sum of
terms. ):|():( i
Gi DPaXScoreDGScore
i
Trees At most one parent per variable
Why trees? Elegant math
we can solve the optimization problem efficiently(with a greedy algorithm)
Sparse parameterization avoid overfitting while adapting to the data
Learning Trees
Let p(i) denote parent of Xi, or 0 if Xi has no parent
We can write the score as
Score = sum of edge scores + constant
ii
ipiiipi
ipii
ipiipi
iii
XScoreXScoreXXScore
XScoreXXScore
PaXScoreDGScore
)()():(
)():(
):():(
0)(:)(
0)(:0)(:)(
Score of “empty”network
Improvement over “empty” network
Learning Trees
Algorithm Construct graph with vertices: 1,...n Set w(ij) = Score(Xj | Xi ) - Score(Xj) Find tree (or forest) with maximal weight
This can be done using standard algorithms in low-order polynomial time by building a tree in a greedy fashion(Kruskal’s maximum spanning tree algorithm)
Theorem: Procedure finds the tree with maximal score
When score is likelihood, then w(ij) is proportional to I(Xi; Xj). This is known as the Chow & Liu method
Learning Trees
Not every edge in tree is in the the original networkTree direction is arbitrary --- we can’t learn about arc direction
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Learning Trees: Example
Correct edges
Spurious edges
Tree learned from data of Alarm network
Beyond Trees Problem is not easy for more complex networks
Example: Allowing two parents, greedy algorithm is no longer guaranteed to find the optimal network
In fact, no efficient algorithm exists Theorem:
Finding maximal scoring network structure with at most k parents for each variable is NP-hard for k>1
Fixed Ordering For any decomposable scoring function
Score(G:D) and ordering the maximal scoring network has:
For fixed ordering we have independent problems
If we bound the in-degree per variable by d, then complexity is exponential in d
):|(maxarg }:{ DXScorePa iXXXGi ijj iU U
(since choice at Xi does not constrain other choices)
We address the problem by using heuristic search Define a search space:
nodes are possible structures edges denote adjacency of structures
Traverse this space looking for high-scoring structures
Search techniques: Greedy hill-climbing Best first search Simulated Annealing ...
Heuristic Search
Typical operations:
Reverse CBDelete
B C
Add B
D
Heuristic Search
B
C
A
DB
C
A
D B
C
A
D
B
C
A
D
Caching: To update the score after a local change, we only need to rescore the families that were changed in the last move
Exploiting Decomposability
B
C
A
D B
C
A
D
B
C
A
D
B
C
A
D
Simplest heuristic local search Start with a given network
empty network best tree a random network
At each iteration Evaluate all possible changes Apply change that leads to best improvement in score Reiterate
Stop when no modification improves score Each step requires evaluating O(n) new
changes
Greedy Hill Climbing
Greedy Hill-Climbing can get stuck in: Local Maxima
All one-edge changes reduce the score Plateaus
Some one-edge changes leave the score unchanged Happens because equivalent networks received the same
score and are neighbors in the search space
Both occur during structure search Standard heuristics can escape both
Random restarts TABU search
Greedy Hill Climbing Pitfalls
Idea Search the space of equivalence classes Equivalence classes can be represented by PDAGs
(partially ordered graph)
Advantages PDAGs space has fewer local maxima and plateaus There are fewer PDAGs than DAGs
Equivalence Class Search
Evaluating changes is more expensive In addition to search, need to score a consistent network
These algorithms are more complex to implement
Add Y—Z
Original PDAG
New PDAG
Consistent DAG
Score
Equivalence Class Search
ZYX
ZYX
ZYX
0
0.5
1
1.5
2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL
Div
erge
nce
M
True Structure/BDe M' = 10Unknown Structure/BDe M' = 10
Learning Example: Alarm Network
Model Selection So far, we focused on single model
Find best scoring model Use it to predict next example
Implicit assumption: Best scoring model dominates the weighted sum
Valid with many data instances
Pros: We get a single structure Allows for efficient use in our tasks
Cons: We are committing to the independencies of a particular
structure Other structures might be as probable given the data
Model Selection Density estimation
One structure may suffice, if its joint distribution is similar to other high scoring structures
Structure discovery Define features f(G) (e.g., edge, sub-structure, d-sep
query) Compute
Still requires summing over exponentially many structures
G
DGPGfDfP )|()()|(
Model Averaging Given an Order
Assumptions Known total order of variables Maximum in-degree for variables d
Marginal likelihood
i dXiiB
GG i
GXiB
GG
i
d
i
d
DUXFamScore
DPaXFamScore
GPGDPDP
}||,:{
,
,
:|(exp
:|(exp
)|()|()|(
UUUU
Using decomposability assumption on prior
P(G|)
Since given ordering , parent choices are
independentCost per family: O(nk)
Total cost: O(nk+1)
Model Averaging Given an Order
Posterior probability of a general feature
Posterior probability of particular choice of parents
Posterior probability of particular edge choice
}||,:{
:|(exp
:|(exp),|(
dXiiB
iBGX
i
i DUXFamScore
DUXFamScoreDPaP
UUUU
U
All terms cancel
out
i dXiiB
GG
i
d
DUXFamScore
GPGDPGf
DP
DfPDfP
}||,:{
,
:|(exp
)|()|()(
)|(
)|,(),|(
UUUU
}||,:{
}||,:{
:|(exp
:|(exp
),|(
dXiiB
dXandXiiB
GXj
i
ij
i DUXFamScore
DUXFamScore
DPaXP
UUUU
UUUUU
Model Averaging We cannot assume that order is known
Solution: Sample from posterior distribution of P(G|D) Estimate feature probability by Sampling can be done by MCMC Sampling can be done over orderings
K
idGf
KDfP
1
)(1
)|(
Notes on Learning Local Structures
Define score with local structures Example: in tree CPDs, score decomposes by leaves
Prior may need to be extended Example: in tree CPDs, penalty for tree structure per
CPD
Extend search operators to local structure Example: in tree CPDs, we need to search for tree
structure Can be done by local encapsulated search or by
defining new global operations
Search: Summary
Discrete optimization problem In general, NP-Hard
Need to resort to heuristic search In practice, search is relatively fast (~100 vars in
~10 min): Decomposability Sufficient statistics
In some cases, we can reduce the search problem to an easy optimization problem Example: learning trees