learning bayesian networks from data nir friedman daphne koller hebrew u. stanford

.

Learning Bayesian Networks from Data

Nir Friedman Daphne Koller

Hebrew U. Stanford

2

Overview

Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

3

Family of Alarm

Bayesian Networks

Qualitative part:

Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence

Quantitative part: Set of conditional probability distributions

0.9 0.1

e

b

e

0.2 0.8

0.01 0.99

0.9 0.1

be

b

b

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Compact representation of probability distributions via conditional independence

Together:Define a unique distribution in a factored form

)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP

4

Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters

…instead of 254

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

5

Inference Posterior probabilities

Probability of any event given any evidence Most likely explanation

Scenario that explains evidence Rational decision making

Maximize expected utility Value of Information

Effect of intervention

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

6

Why learning?

Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert

Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw

data

7

Why Learn Bayesian Networks?

Conditional independencies & graphical language capture structure of many real-world distributions

Graph structure provides much insight into domain Allows “knowledge discovery”

Learned model can be used for many tasks

Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables

8

Learning Bayesian networks

E

R

B

A

C

.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

Data+

Prior Information

Learner

9

Known Structure, Complete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Inducer needs to estimate parameters

Data does not contain missing values

Learner

E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

10

Unknown Structure, Complete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is not specified Inducer needs to select arcs & estimate parameters

Data does not contain missing values

E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

Learner

11

Known Structure, Incomplete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Data contains missing values

Need to consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Learner

12

Unknown Structure, Incomplete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01

.8 .2

be

b

b

e

BE P(A | E,B)

? ?

e

b

e

? ?

? ?

? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is not specified Data contains missing values

Need to consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Learner

13

Overview

Introduction Parameter Estimation

Likelihood function Bayesian estimation

Model Selection Structure Discovery Incomplete Data Learning from Structured Data

14

Learning Parameters

E B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

Training data has the form:

15

Likelihood Function

E B

A

C

Assume i.i.d. samples Likelihood function is

m

mCmAmBmEPDL ):][],[],[],[():(

16

Likelihood Function

E B

A

C

By definition of network, we get

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

17

Likelihood Function

E B

A

C

Rewriting terms, we get

m

m

m

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE=

18

General Bayesian Networks

Generalizing for any Bayesian network:

Decomposition Independent estimation problems

iii

i miii

mn

DL

mPamxP

mxmxPDL

):(

):][|][(

):][,],[():( 1

19

Likelihood Function: Multinomials

The likelihood for the sequence H,T, T, H, H is

m

mxPDPDL )|][()|():(

)1()1():( DL

0 0.2 0.4 0.6 0.8 1

L(

:D)

K

1k

Nk

kDL ):(Probability of kth outcome

Count of kth outcome in D

General case:

20

Bayesian Inference

Represent uncertainty about parameters using a probability distribution over parameters, data

Learning using Bayes rule

])[],1[()()|][],1[(

])[],1[|(MxxP

PMxxPMxxP

Posterior

Likelihood Prior

Probability of data

21

Bayesian Inference Represent Bayesian distribution as Bayes net

The values of X are independent given P(x[m] | ) = Bayesian prediction is inference in this network

X[1] X[2] X[m]

Observed data

22

Priors for each parameter group are independent Data instances are independent given the

unknown parameters

X

X[1] X[2] X[M] X[M+1]

Observed dataPlate notation

Y[1] Y[2] Y[M] Y[M+1]

Y|X X

Y|Xm

X[m]

Y[m]

Query

Bayesian Nets & Bayesian Prediction

23

We can also “read” from the network:

Complete data posteriors on parameters are independent

Can compute posterior over parameters separately!

Bayesian Nets & Bayesian PredictionX

X[1] X[2] X[M]

Observed data

Y[1] Y[2] Y[M]

Y|X

24

Learning Parameters: Summary

Estimation relies on sufficient statistics For multinomials: counts N(xi,pai) Parameter estimation

Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by

accumulating sufficient statistics

)()(),(),(~

|ii

iiiipax paNpa

paxNpaxii

)(),(ˆ

|i

iipax paN

paxNii

MLE Bayesian (Dirichlet)

25

Overview

Introduction Parameter Learning Model Selection

Scoring function Structure search

Structure Discovery Incomplete Data Learning from Structured Data

26

Why Struggle for Accurate Structure?

Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Earthquake Alarm Set

Sound

Burglary Earthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arcMissing an arc

27

Score based Learning

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

E B

A

E

B

A

E

BA

Search for a structure that maximizes the score

Define scoring function that evaluates how well a structure matches the data

28

Likelihood Score for Structure

Larger dependence of Xi on Pai higher score

Adding arcs always helps I(X; Y) I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea…

i

iGii H(X)Pa;I(XMD):L(GD):(G )log

Mutual information between Xi and its parents

29

Max likelihood params

Bayesian Score

Likelihood score:

Bayesian approach: Deal with uncertainty by assigning probability to

all possibilities

)θG,|P(DD):L(G Gˆ

dGPGDPGDP )|(),|()|(

Likelihood Prior over parameters

P(D)G)P(G)P(D

D)P(G|

|

Marginal Likelihood

30

Heuristic Search

Define a search space: search states are possible structures operators make small changes to structure

Traverse space looking for high-scoring structures Search techniques:

Greedy hill-climbing Best first search Simulated Annealing ...

31

Local Search

Start with a given network empty network best tree a random network

At each iteration Evaluate all possible changes Apply change based on score

Stop when no modification improves score

32

Heuristic Search

Typical operations:

S C

E

D Reverse C EDelete C

E

Add C

D

S C

E

D

S C

E

D

S C

E

D

score = S({C,E} D) - S({E} D)

To update score after local change,

only re-score families that changed

33

Learning in Practice: Alarm domain

0

0.5

1

1.5

2

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erge

nce

to

true

dis

trib

utio

n

#samples

Structure known, fit params

Learn both structure & params

34

Local Search: Possible Pitfalls

Local search can get stuck in: Local Maxima:

All one-edge changes reduce the score Plateaux:

Some one-edge changes leave the score unchanged

Standard heuristics can escape both Random restarts TABU search Simulated annealing

35

Improved Search: Weight Annealing

Standard annealing process: Take bad steps with probability exp(score/t) Probability increases with temperature

Weight annealing: Take uphill steps relative to perturbed score Perturbation increases with temperature

Sco

re(G

|D)

G

36

Perturbing the Score

Perturb the score by reweighting instances Each weight sampled from distribution:

Mean = 1 Variance temperature

Instances sampled from “original” distribution … but perturbation changes emphasis

Benefit: Allows global moves in the search space

37

Weight Annealing: ICU Alarm networkCumulative performance of 100 runs of

annealed structure search

Greedyhill-climbing

True structureLearned params

Annealedsearch

38

Structure Search: Summary

Discrete optimization problem In some cases, optimization problem is easy

Example: learning trees In general, NP-Hard

Need to resort to heuristic search In practice, search is relatively fast (~100 vars in

~2-5 min):DecomposabilitySufficient statistics

Adding randomness to search is critical

39

Overview

Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

40

Structure Discovery

Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y

Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes

41

Discovering Structure

Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure

E

R

B

A

C

P(G|D)

42

Discovering Structure

Problem Small sample size many high scoring models Answer based on one model often useless Want features common to many models

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

P(G|D)

43

Bayesian Approach

Posterior distribution over structures Estimate probability of features

Edge XY Path X… Y …

G

DGPGfDfP )|()()|(

Feature of G,e.g., XY

Indicator functionfor feature f

Bayesian scorefor G

44

MCMC over Networks

Cannot enumerate structures, so sample structures

MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D)

Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges

n

iiGf

nDGfP

1

)(1

)|)((

45

ICU Alarm BN: No Mixing 500 instances:

The runs clearly do not mix

MCMC Iteration

Sco

re o

f cu

ure

nt s

ampl

e

-9400

-9200

-9000

-8800

-8600

-8400

0 100000 200000 300000 400000 500000 600000

score

iteration

emptygreedy

46

Effects of Non-Mixing Two MCMC runs over same 500 instances Probability estimates for edges for two runs

Probability estimates highly variable, nonrobust

True BN

Ra

nd

om

sta

rt

True BN

Tru

e B

N

47

Fixed OrderingSuppose that We know the ordering of variables

say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1

Limit number of parents per nodes to k

Intuition: Order decouples choice of parents Choice of Pa(X7) does not restrict choice of Pa(X12)

Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, )

2k•n•log n networks

48

Our Approach: Sample Orderings

We can write

Sample orderings and approximate

MCMC Sampling Define Markov chain over orderings Run chain to get samples from posterior P( | D)

)|(),|()|( DPDfPDfP

),|()|(1

DfPDfP i

n

i

49

Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances

fewer iterations than MCMC-Nets approximately same amount of computation

Process appears to be mixing!

-8450

-8445

-8440

-8435

-8430

-8425

-8420

-8415

-8410

-8405

-8400

0 10000 20000 30000 40000 50000 60000

score

iteration

randomgreedy

MCMC Iteration

Sco

re o

f cu

ure

nt s

ampl

e

50

Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges

500 instances 1000 instances

Probability estimates very robust

51

Application to Gene Array Analysis

52

Chips and Features

53

Application to Gene Array Analysis

See www.cs.tau.ac.il/~nin/BioInfo04 Bayesian Inference :

http://www.cs.huji.ac.il/~nir/Papers/FLNP1Full.pdf

http://www.cs.tau.ac.il/~nin/BioInfo04

54

Application: Gene expression

Input: Measurement of gene expression under

different conditions Thousands of genes Hundreds of experiments

Output: Models of gene interaction

Uncover pathways

55

Map of Feature Confidence

Yeast data [Hughes et al 2000]

600 genes 300 experiments

56

“Mating response” Substructure

Automatically constructed sub-network of high-confidence edges

Almost exact reconstruction of yeast mating pathway

KAR4

AGA1PRM1TEC1

SST2

STE6

KSS1NDJ1

FUS3AGA2

YEL059W

TOM6 FIG1YLR343W

YLR334C MFA1

FUS1

57

Summary of the course

Bayesian learning The idea of considering model parameters as

variables with prior distribution PAC learning

Assigning confidence and accepted error to the learning problem and analyzing polynomial L T

Boosting and Bagging Use of the data for estimating multiple models

and fusion between them

58


Hidden Markov Models Markov property is widely assumed, hidden

Markov model very powerful & easily estimated Model selection and validation

Crucial for any type of modeling! (Artificial) neural networks

The brain performs computations radically different than modern computers & often much better! we need to learn how?

ANN powerful modeling tool (BP, RBF)

59


Evolutionary learning Very different learning rule, has its merits

VC dimensionality Powerful theoretical tool in defining solvable

problems, difficult for practical use Support Vector Machine

Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim

Bayesian networks: compact way to represent conditional dependencies between variables

60

Final project

61

Probabilistic Relational Models

Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies

Links give us potential interactions!

StudentIntelligence

RegGrade

Satisfaction

CourseDifficulty

ProfessorTeaching-Ability

Key ideas:

0% 20% 40% 60% 80% 100%

hard,smart

hard,weak

easy,smart

easy,weakA B C

62

Prof. SmithProf. Jones

Forrest Gump

Jane Doe

Welcome to

CS101

Welcome to

Geo101

PRM Semantics

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM

Grade|Intell,Diffic

63

Welcome to

CS101

weak / smart

The Web of Influence

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

weak smart

0% 50% 100%

easy / hard

Objects are all correlated Need to perform inference over entire model For large databases, use approximate inference:

Loopy belief propagation

64

PRM Learning: Complete Data

Prof. SmithProf. Jones

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

HighLow

Easy

Easy

A

B

C

Like

Hate

Like

Smart

Weak

Grade|Intell,Diffic

Entire database is single “instance” Parameters used many times in instance

Introduce prior over parameters Update prior with sufficient statistics:

Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

65

PRM Learning: Incomplete Data

Welcome to

CS101

Welcome to

Geo101

??????

???

???

B

A

C

Hi

Low

Hi

???

??? Use expected sufficient statistics But, everything is correlated:

E-step uses (approx) inference over entire model

67

Dirichlet Priors

Recall that the likelihood function is

Dirichlet prior with hyperparameters 1,…,K

the posterior has the same form, with

hyperparameters 1+N 1,…,K +N K

K

1k

Nk

kDL ):(

K

kk

kP1

1)(

K

k

Nk

K

k

Nk

K

kk

kkkkDPPDP1

1

11

1)|()()|(

68

Dirichlet Priors - Example

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1)

Dirichlet(2,2)

Dirichlet(5,5)Dirichlet(0.5,0.5)

Dirichlet(heads, tails)

heads

P(

he

ad

s)

69

Dirichlet Priors (cont.)

If P() is Dirichlet with hyperparameters 1,…,K

Since the posterior is also Dirichlet, we get

kk dPkXP )()]1[(

)(

)|()|]1[(N

NdDPDkMXP kk

k

70

Learning Parameters: Case Study

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erge

nce

to tr

ue d

istr

ibut

ion

# instances

MLE

Bayes; M'=50

Bayes; M'=20

Bayes; M'=5

Instances sampled from ICU Alarm network

M’ — strength of prior

71

Marginal Likelihood: Multinomials

Fortunately, in many cases integral has closed form

P() is Dirichlet with hyperparameters 1,…,K

D is a dataset with sufficient statistics N1,…,NK

Then

)()(

)()(

N

NDP

72

Marginal Likelihood: Bayesian Networks

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7

Network 1:Two Dirichlet marginal likelihoods

P( )

P( )

X Y

Integral over X

Integral over Y

H T T H T H HH T T H T H H

H T H H T T HH T H H T T H

73

Marginal Likelihood: Bayesian Networks

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7

Network 2:Three Dirichlet marginal likelihoods

P( )

P( )

P( )

X Y

Integral over X

Integral over Y|X=H

H T T H T H HH T T H T H H

H T H H T T HT H TH H T H

Integral over Y|X=T

74

Marginal Likelihood for Networks

The marginal likelihood has the form:

i paG

i

GDP )|( Dirichlet marginal likelihoodfor multinomial P(Xi | pai)

i i

ii

ii

i

xG

i

Gi

Gi

GG

G

pax

paxNpax

paNpa

pa

)),((

)),(),((

)()(

)(

N(..) are counts from the data(..) are hyperparameters for each family given G

75

As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise

Asymptotic equivalence to MDL score Bayesian score is consistent

Observed data eventually overrides prior

O(1)dim(G)2M

)H(X)Pa;I(XMi

iGii log

O(1)dim(G)2M

D):(GG)|P(D log

log

Fit dependencies inempirical distribution

Complexitypenalty

Bayesian Score: Asymptotic Behavior

76

Structure Search as OptimizationInput:

Training data Scoring function Set of possible structures

Output: A network that maximizes the score

Key Computational Property: Decomposability:

score(G) = score ( family of X in G )

77

Tree-Structured Networks

Trees: At most one parent per variable

Why trees? Elegant math

we can solve the optimization problem

Sparse parameterization avoid overfitting

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNGVENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

78

Learning Trees Let p(i) denote parent of Xi

We can write the Bayesian score as

Score = sum of edge scores + constant

Score of “empty”network

Improvement over “empty” network

i

ii PaXScoreDGScore ):():(

i

ii

iipi XScoreXScoreXXScore )()():( )(

79

Learning Trees

Set w(ji) =Score( Xj Xi ) - Score(Xi) Find tree (or forest) with maximal weight

Standard max spanning tree algorithm — O(n2 log n)

Theorem: This procedure finds tree with max score

80

Beyond Trees

When we consider more complex network, the problem is not as easy

Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find

the optimal network

In fact, no efficient algorithm exists

Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

learning bayesian networks from data nir friedman daphne koller hebrew u. stanford

Documents

b b e bepa e

incomplete data e b

missing values e

parameters u data

cheap u

tasks u

u network structure

missing values learner