being bayesian about network structurerobotics.stanford.edu/people/nir/papers/fk1.pdfstructures is...

10
Being Bayesian about Network Structure Nir Friedman School of Computer Science & Engineering Hebrew University Jerusalem, 91904, Israel [email protected] Daphne Koller Computer Science Department Stanford University Stanford, CA 94305-9010 [email protected] Abstract In many domains, we are interested in analyzing the structure of the underlying distribution, e.g., whether one variable is a direct parent of the other. Bayesian model-selection attempts to find the MAP model and use its structure to answer these questions. However, when the amount of available data is modest, there might be many models that have non-negligible pos- terior. Thus, we want compute the Bayesian poste- rior of a feature, i.e., the total posterior probability of all models that contain it. In this paper, we propose a new approach for this task. We first show how to ef- ficiently compute a sum over the exponential number of networks that are consistent with a fixed ordering over network variables. This allows us to compute, for a given ordering, both the marginal probability of the data and the posterior of a feature. We then use this result as the basis for an algorithm that approx- imates the Bayesian posterior of a feature. Our ap- proach uses an Markov Chain Monte Carlo (MCMC) method, but over orderings rather than over network structures. The space of orderings is much smaller and more regular than the space of structures, and has a smoother posterior “landscape”. We present empiri- cal results on synthetic and real-life datasets that com- pare our approach to full model averaging (when pos- sible), to MCMC over network structures, and to a non-Bayesian bootstrap approach. 1 Introduction In the last decade there has been a great deal of research fo- cused on the problem of learning Bayesian networks (BNs) from data [3, 8]. An obvious motivation for this problem is to learn a model that we can then use for inference or de- cision making, as a substitute for a model constructed by a human expert. In certain cases, however, our goal is to learn a model of the system not for prediction, but for discover- ing the domain structure. For example, we might want to use Bayesian network learning to understand the mecha- nisms by which genes in a cell produce proteins, which in turn cause other genes to express themselves, or prevent them from doing so [6]. In this case, our main goal is to discover the causal and dependence relations between the expression levels of different genes [12]. The common approach to discovering structure is to use learning with model selection to provide us with a single high-scoring model. We then use that model (or its equiv- alence class) as our model for the structure of the domain. Indeed, in small domains with a substantial amount of data, it has been shown that the highest scoring model is or- ders of magnitude more likely than any other [11]. In such cases, model selection is a good approximation. Unfortu- nately, there are many domains of interest where this situ- ation does not hold. In our gene expression example, it is now possible to measure of the expression levels of thou- sands of genes in one experiment [12] (where each gene is a random variable in our model [6]), but we typically have only a few hundred of experiments (each of which is a single data case). In cases, like this, where the amount of data is small relative to the size of the model, there are likely to be many models that explain the data reasonably well. Model selection makes an arbitrary choice between these models, and therefore we cannot be confident that the model is a true representation of the underlying process. Given that there are many qualitatively different struc- tures that are approximately equally good, we cannot learn a unique structure from the data. However, there might be certain features of the domain, e.g., the presence of certain edges, that we can extract reliably. As an extreme example, if two variables are very strongly correlated (e.g., deter- ministically related to each other), it is likely that an edge between them will appear in any high-scoring model. Our goal, therefore, is to compute how likely a feature such as an edge is to be present over all models, rather than a single model selected by the learning algorithm. In other words, we are interested in computing: (1) where is if the feature holds in and otherwise. The number of BN structures is super-exponential in the number of random variables in the domain; therefore, this summation can be computed in closed form only for very small domains, or those in which we have additional con- straints that restrict the space (as in [11]). Alternatively, this summation can be approximated by considering only a subset of possible structures. Several approximations have been proposed [13, 14]. One theoretically well-founded approach is to use Markov Chain Monte Carlo (MCMC)

Upload: others

Post on 08-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

Being Bayesian about Network Structure

Nir FriedmanSchoolof ComputerScience& Engineering

Hebrew UniversityJerusalem,91904,Israel

[email protected]

Daphne KollerComputerScienceDepartment

StanfordUniversityStanford,CA [email protected]

AbstractIn many domains,we are interestedin analyzingthestructureof the underlyingdistribution, e.g.,whetheronevariableis a direct parentof the other. Bayesianmodel-selectionattemptsto find the MAP modelanduseits structureto answerthesequestions.However,when the amountof available data is modest,theremight be many modelsthat have non-negligible pos-terior. Thus, we want computethe Bayesianposte-rior of a feature,i.e., the total posteriorprobabilityofall modelsthatcontainit. In this paper, we proposeanew approachfor this task. We first show how to ef-ficiently computea sumover theexponentialnumberof networks that areconsistentwith a fixed orderingover network variables. This allows us to compute,for a given ordering,both themarginal probability ofthe dataand the posteriorof a feature. We thenusethis result as the basisfor an algorithm that approx-imatesthe Bayesianposteriorof a feature. Our ap-proachusesanMarkov ChainMonteCarlo (MCMC)method,but over orderingsratherthanover networkstructures.Thespaceof orderingsis muchsmallerandmore regular than the spaceof structures,and hasasmootherposterior“landscape”. We presentempiri-cal resultsonsyntheticandreal-lifedatasetsthatcom-pareour approachto full modelaveraging(whenpos-sible), to MCMC over network structures,and to anon-Bayesianbootstrapapproach.

1 Introduction

In thelastdecadetherehasbeenagreatdealof researchfo-cusedontheproblemof learningBayesiannetworks(BNs)from data[3, 8]. An obviousmotivationfor thisproblemisto learna modelthatwe canthenusefor inferenceor de-cisionmaking,asa substitutefor a modelconstructedby ahumanexpert.In certaincases,however, ourgoalis to learna modelof thesystemnot for prediction,but for discover-ing the domainstructure.For example,we might want touseBayesiannetwork learningto understandthe mecha-nismsby which genesin a cell produceproteins,which inturn causeother genesto expressthemselves,or preventthemfrom doing so [6]. In this case,our main goal is todiscover the causalanddependencerelationsbetweentheexpressionlevelsof differentgenes[12].

Thecommonapproachto discoveringstructureis to uselearningwith modelselectionto provide us with a single

high-scoringmodel. We thenusethatmodel(or its equiv-alenceclass)asour modelfor thestructureof thedomain.Indeed,in smalldomainswith asubstantialamountof data,it has beenshown that the highestscoring model is or-dersof magnitudemorelikely thanany other[11]. In suchcases,modelselectionis a goodapproximation.Unfortu-nately, therearemany domainsof interestwherethis situ-ationdoesnot hold. In our geneexpressionexample,it isnow possibleto measureof the expressionlevelsof thou-sandsof genesin oneexperiment[12] (whereeachgeneis a randomvariablein our model [6]), but we typicallyhave only a few hundredof experiments(eachof which isa singledatacase). In cases,like this, wherethe amountof datais small relative to thesizeof themodel,therearelikely to bemany modelsthatexplain the datareasonablywell. Model selectionmakesan arbitrarychoicebetweenthesemodels,andthereforewecannotbeconfidentthatthemodelis a truerepresentationof theunderlyingprocess.

Given that thereare many qualitatively different struc-turesthatareapproximatelyequallygood,we cannotlearna uniquestructurefrom thedata.However, theremight becertainfeaturesof thedomain,e.g.,thepresenceof certainedges,thatwecanextractreliably. As anextremeexample,if two variablesare very strongly correlated(e.g., deter-ministically relatedto eachother),it is likely thatanedgebetweenthemwill appearin any high-scoringmodel. Ourgoal, therefore,is to computehow likely a featuresuchasanedgeis to bepresentoverall models,ratherthanasinglemodelselectedby the learningalgorithm. In otherwords,we areinterestedin computing:������������ ���������������������

(1)

where������

is � if thefeatureholdsin�

and � otherwise.Thenumberof BN structuresis super-exponentialin the

numberof randomvariablesin thedomain;therefore,thissummationcanbe computedin closedform only for verysmall domains,or thosein which we have additionalcon-straintsthat restrict the space(as in [11]). Alternatively,thissummationcanbeapproximatedby consideringonly asubsetof possiblestructures.Severalapproximationshavebeenproposed[13, 14]. One theoreticallywell-foundedapproachis to useMarkov Chain Monte Carlo (MCMC)

Page 2: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

methods:we definea Markov chainoverstructureswhosestationarydistribution is theposterior

���������, we then

generatesamplesfrom thischain,andusethemto estimateEq.(1).

In this paper, we proposea new approachfor evaluatingtheBayesianposteriorprobabilityof certainstructuralnet-work properties.Ourapproachis basedon two mainideas.Thefirst is anefficient closedform equationfor summingoverall networkswith atmost � parentspernode(for someconstant� ) thatareconsistentwith afixedorderingoverthenodes.Thisequationallowsusbothto computetheoverallprobabilityof thedatafor thissetof networks,andto com-putetheposteriorprobability of certainstructuralfeatures— edgesandMarkov blankets—over this set.Thesecondideais theuseof anMCMC approach,butoverorderingsofthenetwork variablesratherthandirectlyonBN structures.

Thespaceof orderingsis muchsmallerthanthespaceofnetwork structures;it alsoappearsto bemuchlesspeaked,allowing muchfastermixing (i.e., convergenceto the sta-tionarydistribution of theMarkov chain). We presentem-pirical resultsillustratingthisobservation,showing thatourapproachhas substantialadvantagesover direct MCMCover BN structures. The Markov chain over orderingsmixesmuch fasterandmore reliably than the chainovernetwork structures.Indeed,differentrunsof MCMC overnetworks typically lead to very different estimatesin theposterior probabilities of structural features,illustratingpoorconvergenceto thestationarydistribution;by contrast,differentrunsof MCMC over orderingsconvergereliablyto thesameestimates.Wealsopresentresultsshowing thatour approachaccuratelycapturesdominantfeaturesevenwith sparsedata,andthatit outperformsbothMCMC overstructuresandthenon-Bayesianbootstrapof [5].

2 Bayesian learning of Bayesian networks

Considerthe problem of analyzingthe distribution oversomesetof randomvariables� � �"!#!"!$� �&% , basedonafullyobserveddataset

�'�)(+*-, �". �#!"!"!"�/*-, 0 .�1 , whereeach*-, 2 .

is a completeassignmentto thevariables�� �"!#!"!#� ��% .2.1 The Bayesian learning framework

The Bayesianlearningparadigmtells us thatwe mustde-fine a prior probabilitydistribution

����34over thespaceof

possibleBayesiannetworks3

. This prior is thenupdatedusingBayesianconditioningto giveaposteriordistribution���536�7�

over this space.For Bayesiannetworks,thedescriptionof a model

3has

two components:the structure�

of the network, andthevaluesof the numericalparameters8 � associatedwith it.For example,in adiscreteBayesiannetwork of structure

�,

theparameters8 � definea multinomialdistribution 8�9�:<; =for eachvariable �&> andeachassignmentof values ? to@BA � � ��> . If weconsiderGaussianBayesiannetworksovercontinuousdomains,then 8C9D:�; = containsthe coefficientsfor a linearcombinationof ? anda varianceparameter.

To definethe prior���53E

, we needto definea discrete

probability distribution over graph structures�

, and foreachpossiblegraph

�, to definea continuousdistribution

over thesetof parameters8 � .The prior over structuresis usually consideredthe less

importantof the two components.Unlike other partsoftheposterior, it doesnot grow asthenumberof datacasesgrows.Hence,relatively little attentionhasbeenpaidto thechoiceof structureprior, andasimpleprior is oftenchosenlargely for pragmaticreasons.The simplestandthereforemostcommonchoiceis auniformprior overstructures[8].An alternativeprior, andtheoneweusein ourexperiments,considersthe numberof optionsin determiningthe fami-lies of

�. If we decidethata node ��> has � parents,then

thereare F %HGI�JLK possibleparentssets.If weassumethatwechooseuniformly from these,we geta prior:������NM %O>QP � R SUT �� @BA � � ��> #� V GW� !Note that the negative logarithmof this prior correspondsto the descriptionlengthof specifyingthe parentsets,as-sumingthat thecardinalityof thesesetsareknown. Thus,we implicitly assumethat cardinalitiesof parentsetsareuniformly distributed.

A key propertyof all thesepriorsis thatthey satisfy:X Structure modularity Theprior������

canbewrittenin theform �������M O >ZY � ��> ��@BA � � ��> ��!

That is, the prior decomposesinto a product,with a termfor eachfamily in

�. In other words the choicesof the

familiesfor thedifferentnodesareindependentapriori.Next we considertheprior over parameters,

��� 8 � �[�� .Here, the form of the prior variesdependingon the typeof parametricfamilieswe consider. In discretenetworks,thestandardassumptionis a Dirichlet prior over 8C9�:/; = foreachnode��> andeachinstantiation? to its parents[8]. InGaussiannetworks,we might usea Wishartprior [9]. Forour purpose,we needonly requirethat the prior satisfiestwo basicassumptions,aspresentedin [10]:X Global parameter independence: Let 8�9 : ; \^]�_I`a9 :�b

betheparametersspecifyingthebehavior of thevari-able ��> giventhevariousinstantiationsto its parents.Thenwe requirethat��� 8 � ����N� O > ��� 8c9 : ; \^]d_I`Q9 :�b �e�� (2)X Parameter modularity: Let

�and

�gfbetwo graphs

in which@BA � � ��> ��h@BA �ji � ��> -�6k then��� 8c9�:�; l �7����h��� 8C9D:�; l ��� f (3)

Oncewedefinetheprior, wecanexaminetheform of theposteriorprobability. UsingBayesrule,wehave that�����m���n�Mo�����p����<�������!

Page 3: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

The term�����q�r��

is themarginal likelihoodof thedatagiven

�, andis definedtheintegrationoverall possiblepa-

rametervaluesfor�

.������������6st���������&� 8 � <��� 8 � ����/u 8 �The term

�����v�N�&� 8 � is simply the probability of thedatagiven a specificBayesiannetwork. Whenthe dataiscomplete, this is simply aproductof conditionalprobabili-ties.

Usingtheaboveassumptions,onecanshow (see[10]):

Theorem 2.1: If�

is completeand������

satisfiesparam-eterindependenceandparametermodularity, then������������ O > score

� � > �w@BA � � � > x���$!where score

� � > ��ky��� iss O{z ����| > , } . � ? , } . � 8C9 : ; l /��� 8�9 : ; l <u 8c9 : ; lIf theprior alsosatisfiesstructuremodularity, we canalsoconcludethatposteriorprobabilitydecomposes:�����~�7�n-M O > Y � � > �w@BA � � � > / score

� � > ��@BA � � � > x�7��!2.2 Bayesian model averaging

Recallthatour goalis to computetheposteriorprobabilityof somefeature

������over all possiblegraphs

�. This is

equalto: ���������n�� � ������/���������Theproblem,of course,is that thenumberof possibleBNstructuresis super-exponential:�e� ` %e�^� �w�[% b , where

Sis the

numberof variables.We can reducethis numberby restricting attentionto

structures�

wherethereis a bound � on the numberofparentsper node. This assumption,which we will makethroughoutthis paper, is a fairly innocuousone.Therearefew applicationsin whichvery largefamiliesarecalledfor,andthereis rarelyenoughdatato supportrobustparameterestimationfor suchfamilies.Froma moreformal perspec-tive, networks with very large families tend to have lowscore.Let � J bethesetof all graphswith indegreeboundedby � . Notethatthenumberof structuresin � J is still super-exponential:at least � J %j� ���[% .

Thus,exhaustiveenumerationoverthesetof possibleBNstructuresis feasibleonly for tiny domains(4–5 nodes).Onesolution,proposedby severalresearchers[11, 13, 14],is to approximatethis exhaustive enumerationby findinga set � of high scoringstructures,andthenestimatingtherelativemassof thestructuresin � thatcontains

�:�������7�-��� �-�^� �����~�7��������� �B�^� @��#���m���n !

(4)

This approachleavesopenthe questionof how we con-struct � . The simplestapproachis to usemodelselectionto pick a singlehigh-scoringstructure,andthenusethatasour approximation.If the amountof datais large relativeto thesizeof themodel,thentheposteriorwill besharplypeakedarounda singlemodel,andthis approximationis areasonableone.However, aswe discussedin theintroduc-tion, therearemany interestingdomains(e.g.,our biologi-cal application)wheretheamountof datais small relativeto thesizeof themodel.In thiscase,thereis usuallya largenumberof high-scoringmodels,sousingasinglemodelasourset � is averypoorapproximation.

A simpleapproachto finding a largersetis to recordallthe structuresexaminedduring the search,andreturn thehigh scoringones. However, the set of structuresfoundin this manneris quite sensitive to the searchprocedurewe use.For example,if we usegreedyhill-climbing, thenthe setof structureswe will collect will all be quite simi-lar. Sucha restrictedsetof candidatesalsoshow up whenwe considermultiple restartsof greedyhill-climbing andbeam-search.This is a seriousproblemsincewe run therisk of gettingestimatesof confidencethatarebasedon abiasedsampleof structures.

Madigan and Raftery [13] proposean alternative ap-proach called Occam’s window, which rejects modelswhoseposteriorprobability is very low, as well as com-plex modelswhoseposteriorprobabilityis notsubstantiallybetterthana simplermodel(onethat containsa subsetofthe edges).Thesetwo principlesprunethespaceof mod-els considered,often to a numbersmall enoughto be ex-haustively enumerated.MadiganandRafteryalsoprovidea searchprocedurefor finding thesemodels.

An alternative approach, proposedby Madigan andYork [14], isbasedontheuseof MarkovchainMonteCarlo(MCMC) simulation. In this case,we define a MarkovChainover thespaceof possiblestructures,whosestation-ary distribution is theposteriordistribution

�������^�. We

thengenerateasetof possiblestructuresby doingarandomwalk in thisMarkov chain.Assumingthatwecontinuethisprocessuntil the chainmixes,we canhopeto get a setofstructuresthat is representative of theposterior. However,it is notclearhow rapidly this typeof chainmixesfor largedomains. The spaceof structuresis very large, and theprobability distribution is often quite peaked, with neigh-boringstructureshaving very differentscores.Hence,themixing rateof theMarkov chaincanbequiteslow.

3 Closed form for known ordering

In thissection,wetemporarilyturnourattentionto asome-what easierproblem. Ratherthanperformmodelaverag-ing over thespaceof all structures,we restrictattentiontostructuresthatareconsistentwith someknown total order-ing � . In otherwords,we restrictattentionto structures�

whereif ��>g� @BA � � ��� then ��� 2 . This assumptionwasa standardonein theearlywork on learningBayesiannetworksfrom data[4].

Page 4: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

3.1 Computing marginal likelihood

We first considertheproblemof computingtheprobabilityof thedatagiventheordering:������� � -� �-�^��� ������� � <���������� (5)

Note that this summation,althoughrestrictedto networkswith boundedindegreeandconsistentwith � , is still expo-nentiallylarge: thenumberof suchstructuresis still at least� J %j� �w�[% .

Thekey insightis that,whenwerestrictattentionto struc-turesconsistentwith agivenordering � , thechoiceof fam-ily for one node placesno additional constraintson thechoiceof family for another. Note that this propertydoesnot hold without therestrictionon theordering;for exam-ple, if we pick � > to bea parentof � � , then � � cannotinturn beaparentof � > .

Therefore,we canchoosea structure�

consistentwith� by choosing,independently, a familyk

for eachnode��> . The parametermodularity assumptionEq. (3) statesthat thechoiceof parametersfor the family of ��> is inde-pendentof the choiceof family for anotherfamily in thenetwork. Hence,summingover possiblegraphsconsistentwith � is equivalentto summingover possiblechoicesoffamily for eachnode,eachwith its parameterprior. Givenour constrainton thesizeof thefamily, thepossibleparentsetsfor thenode� > is� >5��� ��(+k���k � ��> �+� k¡�£¢ ��1 !where

k �y� > is definedto hold when all nodesink

precede� > in � . Giventhat,wehave������� � M �-�^��� O > Y � � > ��@BA � � � > � O > score� � > ��@BA � � � > E�7�� O > l �¥¤ :§¦ ¨ Y � �&> ��k© score

� ��> ��kª�7�-! (6)

Intuitively, theequalitystatesthatwe cansumoverall net-worksconsistentwith � by summingover thesetof possi-ble familiesfor eachnode,andthenmultiplying theresultsfor the differentnodes. This transformationallows us tocompute

�����q� � very efficiently. Theexpressionon theright-handsideconsistsof a productwith a term for eachnode � > , eachof which is a summationover all possiblefamiliesfor � > . Giventhebound� overthenumberof par-ents,the numberof possiblefamilies for a node ��> is atmost F % J K ¢ S J . Hence,thetotal costof computingEq.(6)is at most

Sn«#S J � S J�¬ � .We notethat thedecompositionof Eq. (6) wasfirst men-

tionedby Buntine [2], but the ramificationsfor Bayesianmodel averaging were not pursued. The concept ofBayesianmodelaveragingusinga closed-formsummationover anexponentiallylargesetof structureswasproposed(in a differentsetting)in [17].

The computationof�����­� � is useful in andof itself;

aswe show in thenext section,computingthe probability������� � is a key stepin ourMCMC algorithm.

3.2 Probabilities of features

For certaintypesof features�

, we canusethesametech-niqueto compute,in closedform, the probability

�����h� ����nthat

�holdsin a structuregiven the orderingandthe

data.In general,if

��� « is a feature.We wantto compute������� � �w�-� �����C����� � ������� � !

We have just shown how to computethedenominator. Thenumeratoris a sumover all structuresthatcontainthefea-tureandareconsistentwith theordering:�����C�w��� � -�® �B�^����������/������� � <���������� (7)

Thecomputationof this termdependson thespecifictypeof feature

�.

The simplestsituationis whenwe want to computetheposteriorprobability of the

� 9 :�¯ 9D° that denotesan edge��>j±²��� . In thiscase,wecanapplythesameclosedformanalysisto (7). Theonly differenceis thatwe restrict

� �w���to consistonly of subsetsthatcontain �&> . Sincethetermsthatsumovertheparentsof � J for �Z³�L2 arenotdisturbedby thisconstraint,they cancelout from theequation.

Proposition 3.1:����� 9 :�¯ 9´° � � ���-���µ l �¥¤ :5¦ ¨·¶+lE¸e9 °w¹ Y � ��> ��k© score

� ��> ��ky���� l �¥¤ :5¦ ¨ Y � ��> �wkU score� ��> �wkª���n

The sameargumentcan be extendedto ask more com-plex queriesabouttheparentsof � > . For example,we cancomputetheposteriorprobabilityof a particularchoiceofparents,as����@BA � � ��> N�okª�7�º� � -�Y � ��> �wkU score

� �&> ��kª���� l i �¥¤ :§¦ ¨ Y � �&> ��k f score� ��> �wk f ��� ! (8)

A somewhatmoresubtlecomputationis requiredto com-putetheposteriorof

� 9 :�» ¼ 9D° , the featurethatdenotesthat��> is in theMarkov blanket of ��� . Recallthis is thecaseif�

containstheedge��>j±½��� , or theedge���¾±²�&> , orthereis a variable � J suchthatbothedges��>N±¿� J and���¾±²� J arein

�.

Assume,without lossof generality, that ��> precedes���in the ordering. In this case,��> canbe in ��� ’s Markovblanket either if there is an edgefrom � > to � � , or if� > and � � are both parentsof somethird node �·À . Wehave just shown how the first of theseprobabilitiesÁ � ������ 9D: ¯ 9 ° �r�º� � , canbecomputedin closedform. Wecanalsoeasilycomputetheprobability Â"À �m��� � > � � � �@BA � � � À ·���Ã� � thatboth ��> and ��� areparentsof � À :wesimply restrict

� À ��� to familiesthatcontainboth ��> and��� . The key is to note that as the choiceof families ofdifferentnodesareindependent,theseareall independentevents. Hence, ��> and ��� are not in the sameMarkovblanketonly if all of theseeventsfail to occur. Thus,

Page 5: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

Proposition 3.2:����� 9 : » ¼ 9D° �7�º� � -� � T � � T Á[� « O9BÄ�ÅC9D° � � T  À Unfortunately, this approachcannotbe usedto compute

the probability of arbitrarystructuralfeatures.For exam-ple, we cannotcomputethe probability that thereexistssomedirectedpath from ��> to ��� , aswe would have toconsiderall possiblewaysin which this pathcouldmani-festthroughourexponentiallymany structures.

We canovercomethis difficulty usinga simplesamplingapproach.Eq. (8) providesus with a closedform expres-sionfor theexactposteriorprobabilityof thedifferentpos-siblefamiliesof thenode��> . Wecanthereforeeasilysam-pleentirenetworksfrom theposteriordistributiongiventheordering:wesimplysampleafamily for eachnode,accord-ing to thedistribution in Eq. (8). We canthenusethesam-plednetworksto evaluateany feature,suchastheexistenceof a causalpathfrom ��> to ��� .4 MCMC methods

In the previoussection,we madethesimplifying assump-tion that we were given a predeterminedordering. Al-though this assumptionmight be reasonablein certaincases,it is clearlytoorestrictive in domainswherewehavevery little prior knowledge(e.g.,our biology domain).Wethereforewant to considerstructuresconsistentwith all

SBÆpossibleorderingsoverBN nodes.Here,unfortunately, wehave no elegant tricks that allow a closedform solution.Therefore,we provide a solution which usesour closedform solutionof Eq. (6) asasubroutinein aMarkov ChainMonteCarloalgorithm[15].

4.1 The basic algorithm

We introducea uniform prior overorderings� , anddefine������ � to be of the samenatureas the priors we usedin theprevioussection. It is importantto notethat the re-sulting prior over structureshasa different form thanouroriginal prior over structures. For example,if we define������� � to be uniform, we have that

������is not uni-

form: graphsthat areconsistentwith moreorderingsaremorelikely; for example,a Naive Bayesgraphis consis-tentwith

� SUT � Æ orderings,whereasany chain-structuredgraphis consistentwith only one. While this discrepancyin priors is unfortunate,it is importantto seeit in propor-tion. The standardpriors over network structuresareof-tenusednot becausethey areparticularlyappropriatefor atask,but ratherbecausethey aresimpleandeasyto workwith. In fact,theubiquitousuniformprior overstructuresisfarfrom uniformoverPDAGs(Markov equivalenceclassesover network structures)— PDAGs consistentwith morestructureshave a higher inducedprior probability. Onecanarguethat, for causaldiscovery, a uniform prior overPDAGs is moreappropriate;nevertheless,a uniform priorover networksis mostoftenusedfor practicalreasons.Fi-nally, theprior inducedover our networksdoeshave some

justification:onecanarguethatastructurewhichis consis-tent with moreorderingsmakesfewer assumptionsaboutcausalordering,andis thereforemorelikely apriori.

We now constructa Markov chain È with statespaceÉ, consistingof all

SBÆorderings � ; our constructionwill

guaranteethat È hasthestationarydistribution��� � �7� .

We can thensimulatethis Markov chain,obtaininga se-quenceof samples��� �"!#!"!"� �xÊ . We cannow approximatetheexpectedvalueof any function Ë � � as:

IE, Ë �7� . � �Ì Ê Í P � Ë � � Í �!

Specifically, we canlet Ë � � be�����Î� � ��� for somefea-

ture(edge)�

. WecanthencomputeË � � Í ��o������� � Í ��� ,asdescribedin theprevioussection.

It remainsonly to discusstheconstructionof theMarkovchain. We usea standardMetropolisalgorithm[15]. Weneedto guaranteetwo things:X that the chain is reversible,i.e., the probability

��� �±Ï� fÐ-�h��� � f ±Ï� ;X that the stationaryprobability of the chain is the de-siredposterior

��� � ���n .We accomplishthis goalusinga standardMetropolissam-pling. For eachordering � , we definea proposalproba-bility  � � fÑ� � , which definestheprobability that thealgo-rithm will “propose”a move from � to � f . Thealgorithmthenacceptsthis movewith probabilityÒ&ÓaÔ¡Õ � � ��� � fÑ���  � � � � fQ��� � �¥�  � � f � � IÖ !It is well known that the resultingchain is reversibleandhasthedesiredstationarydistribution [7].

We considerseveral specificconstructionsfor the pro-posaldistribution,basedon differentneighborhoodsin thespaceof orderings. In one very simple construction,weconsideronly operatorsthatflip two nodesin theordering(leaving all othersunchanged):� � � !"!#! ��� !"!"! � J !#!"! � % -×± � � � !"!#! � J !"!#! ��� !#!"! � % $!Anotheroperatoris “cutting thedeck” in theordering:� �/� !#!"! � � � � ¬ � !#!"! �Ñ% -×± � � � ¬ � !#!"! �Ñ%Ø�/� !#!"! � � �!

Wenotethatthesetwo typesof operatorsarequalitativelyverydifferent.The“flip” operatortakesmuchsmallerstepsin the space,and is thereforelikely to mix much moreslowly. However, any singlestepis substantiallymoreef-ficient to compute(seebelow). In our implementation,wechoosea flip operatorwith someprobability Á , anda cutoperatorwith probability � T Á . We thenpick eachof thepossibleinstantiationsuniformly (i.e., given that we havedecidedto cut,all

Spositionsareequallylikely).

Page 6: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

PD

AG

sÙOrder

50 inst.200 inst.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

PD

AG

sÙOrder

50 inst.200 inst.

Flare Vote

Figure1: Comparisonof posteriorprobabilitiesfor differ-ent Markov featuresbetweenfull Bayesianaveragingus-ing: orderings(

|-axis)versusPDAGs( Ú -axis)for two UCI

datasets(5 variableseach).

4.2 Computational tricks

Although the computationof the marginal likelihood ispolynomialin

S, it canstill bequiteexpensive, especially

for large networks andreasonablesize � . We utilize sev-eral computationaltricks for reducingthe complexity ofthis computation.

First, for eachnode ��> , we restrictattentionto at most}©Ûothernodesaspossibleparents(for somefixed

}UÛ).

We selectthese} Û

nodesin advance,beforeany MCMCstep,asfollows: for eachpotentialparent� � , we computethescoreof thesingleedge� � ±Ü� > ; we thenselectthe} Û

nodes� � for which this scorewashighest.Second,for eachnode ��> , we precomputethe scorefor

somenumber}©Ý

of the highest-scoringfamilies. Again,this procedureis executedonce,at the very beginning oftheprocess.Thelist of highest-scoringfamiliesis sortedindecreasingorder;let Þ"> bethescoreof theworst family in��> ’s list. As we considera particularordering,we extractfrom the list all familiesconsistentwith thatordering.Weknow that all familiesnot in the list scoreno betterthanÞ > . Thus,if thebestfamily extractedfrom the list is somefactor ß betterthan Þ > , wechooseto restrictattentionto thefamiliesextractedfrom the list, undertheassumptionthatother familieswill have negligible effect relative to thesehigh-scoringfamilies. If the bestfamily extractedis notthatgood,wedo a full enumeration.

Third, we prune the exhaustive enumerationof fami-lies by ignoring families that augmentlow-scoringfami-lies with low-scoringedges.Specifically, assumethat forsomefamily

k, we have that score

� ��> �wk��is sub-

stantially lower thanotherfamiliesenumeratedso far. Inthis case, families that extend

kare likely to be even

worse. More precisely, we definethe incrementalvalueof a parentá for � > to be its addedvalueasa singlepar-ent: â � áã/� > � score

� � > � á T score� � > . If we now

have a familyk

suchthat, for all other possibleparentsá , score� ��> ��k©jä â � áã/��> is lower thanthebestfamily

foundsofar for ��> , wepruneall extensionsofk

.Finally, we notethatwhenwe take a singleMCMC step

in thespace,we canoftenpreserve muchof our computa-tion. In particular, let � be an orderingandlet � f be the

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

MC

MC

Exact

5 samples20 samples50 samples

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

MC

MC

Exact

5 samples20 samples50 samples

Markov Edges

Figure2: Comparisonof posteriorprobabilitiesusingtrueposteriorover orderings(

|-axis) versusordering-MCMC

( Ú -axis). Thefiguresshow Markov featuresandEdgefea-turesin theFlaredatasetwith 100samples.

orderingobtainedby flipping ��� and � J . Now, considerthetermsin Eq. (6); thosetermscorrespondingto nodes��å intheordering � thatprecede��� or succeed� J donotchange,asthesetof potentialparentsets

� > Ä ��� is thesame.Further-more,thetermsfor �ÑÀ thatarebetween� � and � J alsohavea lot in common— all parentsets

kthatcontainneither� �

nor � J remainthesame.Thus,weonly needto subtract µ l �¥¤ :5¦ ¨·¶"lx¸e9�: ° ¹ Y � ��> ��k© score� ��> ��kª�7�

andadd µ l �¥¤ :5¦ ¨ i ¶#lE¸e9�: � ¹ Y � � > ��k© score� � > �wk����$!

By contrast,the“cut” operatorrequiresthatwe recomputetheentiresummationover familiesfor eachvariable� > .5 Experimental Results

We first comparedtheexactposteriorcomputedby sum-ming overall orderingsto theposteriorcomputedby sum-ming over all equivalenceclassesof Bayesiannetworks(PDAGs). (I.e., we countedonly a single representativenetwork for eachequivalenceclass.) The purposeof thisevaluationis to try andevaluatetheeffectof thesomewhatdifferentprior overstructures.Of course,in orderto do theexact Bayesiancomputationwe needto do an exhaustiveenumerationof hypotheses.For orderings,this enumera-tion is possiblefor asmany as10 variables,but for struc-tures,we are limited to domainswith 5–6 variables. Wetooktwo datasets— VoteandFlare— from theUCI repos-itory [16] andselectedfive variablesfrom each. We gen-erateddatasetsof sizesæe� and �7�^� , andcomputedthe fullBayesianaveragingposteriorfor thesedatasetsusingbothmethods.Figure1 comparesthe resultsfor bothdatasets.We seethat for smallamountsof data,thetwo approachesareslightly differentbut in generalquite well correlated.This illustratesthat,at leastfor smalldatasets,theeffectofourdifferentprior doesnot dominatetheposteriorvalue.

Next, we comparedthe estimatesmadeby our MCMCsampling over orderingsto estimatesgiven by the full

Page 7: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

Structure

-2500

-2450

-2400

-2350

-2300

-2250

-2200

0 20000 40000 60000 80000 100000 120000

scor

eçiteration

emptygreedy

-9400

-9200

-9000

-8800

-8600

-8400

0 100000 200000 300000 400000 500000 600000

scor

eçiteration

emptygreedy

-19000

-18500

-18000

-17500

-17000

-16500

-16000

0 100000 200000 300000 400000 500000 600000

scor

eçiteration

emptygreedy

Order

-2180

-2170

-2160

-2150

-2140

-2130

-2120

0 2000 4000 6000 8000 10000 12000

scor

eçiteration

random-8450

-8445

-8440

-8435

-8430

-8425

-8420

-8415

-8410

-8405

-8400

0 10000 20000 30000 40000 50000 60000

scor

eçiteration

randomgreedy

-16260

-16255

-16250

-16245

-16240

-16235

-16230

-16225

-16220

0 10000 20000 30000 40000 50000 60000

scor

eçiteration

randomgreedy

100instances 500instances 1000instances

Figure3: Plotsof theprogressionof theMCMC runs.Eachgraphshowsplotsof 6 independentrunsoverAlarm with either100,500,and1000samples.Thegraphplot thescore( èQéeêØë �������q�c��<������/

or èaé^ê^ë �������ì� � /��� � / ) of the “current”candidate( Ú -axis)for differentiterations(

|-axis)of theMCMC sampler. In eachplot, threeof therunsareseededwith the

network foundby greedyhill climbing searchover network structures.Theotherthreerunsareseedeitherby theemptynetwork in thecaseof thestructure-MCMCor a randomorderingin thecaseof ordering-MCMC.

Bayesianaveragingover networks. We experimentedonthenine-variable“flare” dataset.We rantheMCMC sam-pler with a burn-in periodof 1,000stepsandthensampledevery 100 steps;we experimentedwith collecting 5, 20,and50 samples.(We notethat theseparametersareprob-ably excessive, but they ensurethat we aresamplingveryclosethestationaryprobabilityof theprocess.)Theresultsareshown in Figure2. Aswecansee,theestimatesareveryrobust. In fact,for Markov featuresevena sampleof 5 or-deringsgivesa surprisinglydecentestimate.This is duetothefact thata singlesampleof anorderingcontainsinfor-mationaboutexponentiallymany possiblestructures.Foredgeswe obviously needmoresamples,asedgesthat arenot in thedirectionof theorderingnecessarilyhaveproba-bility 0. With 20and50samplesweseeaveryclosecorre-lation betweenthe MCMC estimateandthe exactcompu-tationfor bothtypesof features.

We then consideredlarger datasets,where exhaustiveenumerationis not an option. For this purposewe usedsyntheticdatageneratedfrom theAlarmBN [1], anetworkwith 37 nodes. Here,our computationaltricks areneces-sary. We usedthe following settings: � (max.numberofparentsin a family)

��í;} Û

(max. numberof potentialparents)

� �e� ; } Ý (numberof familiescached)��î �e�e� ;

and ß (differencein scorerequiredin pruning)� �+� . Note

that ß � �+� correspondsto a differenceof � �ðï in thepos-terior probability of the families. We note that differentfamilieshave hugedifferencesin score,soa differenceof� �<ï in theposteriorprobabilityis not uncommon.

Here,our primarygoalwasthecomparisonof structure-MCMC and ordering-MCMC.For the structureMCMC,we useda burn in of 100,000iterationsandthensampledevery 25,000iterations. For the orderMCMC, we usedaburn in of 10,000iterationsandthensampledevery 2,500iterations.In bothmethodswe collecteda total of 50 sam-plesperrun. Onephenomenonthatwasquiteclearwasthatordering-MCMCrunsmixedmuchfaster. That is, after asmallnumberof iterations,theserunsreacheda “plateau”wheresuccessive sampleshad comparablescores. Runsstartedin differentplaces(including randomorderingandorderingsseededfrom theresultsof agreedy-searchmodelselection)rapidly reachedthe sameplateau.On the otherhand,MCMC runs over network structuresreachedverydifferent levels of scores,even thoughthey were run formuchlargernumberof iterations.Figure3 illustratesthisphenomenonfor examplesof alarm with 100, 500, and1000instances.Notethesubstantialdifferencein scalebe-tweenthetwo setsof graphs.

In the case of 100 instances,both MCMC samplersseemedto mix. The structurebasedsamplermixesafterabout20,000–30,000iterationswhile the orderingbasedsamplermixesafter about1,000–2,000iterations. On theotherhand,whenwe examine500 samples,the ordering-MCMC convergesto a high-scoringplateau,which we be-lieveis thestationarydistribution,within 10,000iterations.By contrast,differentrunsof the structure-MCMCstayedin very different regionsof the in the first 500,000itera-tions. Thesituationis evenworsein the caseof 1,000in-

Page 8: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

50 instances 500 instances

emptyvs.empty emptyvs.empty greedyvs.greedy greedyvs.empty

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure 4: Scatterplots that compareposteriorprobability of Markov featureson the Alarm dataset,as determinedbydifferentrunsof structure-MCMC.Eachpoint correspondsto a singleMarkov feature;its

|and Ú coordinatesdenotethe

posteriorestimatedby thetwo comparedruns. Thepositionof pointsis slightly randomlyperturbedto visualizeclustersof pointsin thesameposition.

stances.In this casethe structurebasedMCMC samplerthatstartsfrom anemptynetwork doesnot reachthe levelof scoreachieved by the runs startingfrom the structurefoundby greedyhill climbing search.Moreover, theselat-ter runs seemto fluctuatearoundthe scoreof the initialseed. Note that runsshow differencesof 100 – 500 bits.Thus,thesub-optimalrunssamplefrom networks thatareat least � �<ï�ï lessprobable!

This phenomenonhastwo explanations.Either the seedstructureis theglobaloptimumandthesampleris samplingfrom theposteriordistribution,which is “centered”aroundtheoptimum;or thesampleris stuckin a local “hill” in thespaceof structuresfrom which it cannotescape.This lat-ter hypothesisis supportedby the fact that runsstartingatotherstructures(e.g.,theemptynetwork) take a very longtime to reachsimilar level of scores,indicatingthat thereis averydifferentpartof thespaceon whichstationarybe-havior is reached.

We canprovide furthersupportfor this secondhypothe-sisby examiningtheposteriorcomputedfor differentfea-tures in different runs. Figure 4 comparesthe posteriorprobability of Markov featuresassignedby differentrunsof structure-MCMC.Althoughdifferentrunsgiveasimilarprobability estimateto most structuralfeatures,thereareseveral featureson which they differ radically. In particu-lar, therearefeaturesthatareassignedprobabilitycloseto1 by samplesfrom onerun andprobability closeto 0 bysamplesfrom the other. While this behavior is lesscom-monin therunsseededwith thegreedystructure,it occurseventhere.Thissuggeststhateachof theseruns(evenrunsthatstartat thesameplace)getstrappedin adifferentlocalneighborhoodin thestructurespace.

By contrast,comparisonof the predictionsof differentrunsof the orderbasedMCMC sampleraretightly corre-lated. Figure 5 comparestwo runs,onestartingfrom anorderingconsistentwith thegreedystructureandtheotherfrom a randomorder. We canseethat the predictionsarevery similar, both for thesmall datasetandthe largerone.This observation reaffirms our claim that thesedifferent

50 instances 500instances

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure5: Scatterplots that compareposteriorprobabilityof Markov featureson theAlarm domainasdeterminedbydifferentrunsof ordering-MCMC.Eachpoint correspondsto a singleMarkov feature;its

|and Ú coordinatesdenote

theposteriorestimatedby thegreedyseededrunandaran-domseededrun respectively.

runsareindeedsamplingfrom similar distributions. Thatis, they aresamplingfrom thetrueposterior.

We believe thatthedifferencein mixing rateis dueto thesmootherposteriorlandscapeof thespaceof orderings.Inthe spaceof networks,evena small perturbationto a net-work canleadto ahugedifferencein score.By contrast,thescoreof an orderingis a lot lesssensitive to slight pertur-bations.For one,thescoreof eachorderingis anaggregateof thescoresof avery largespaceof structures;hence,dif-ferencesin scoresof individual networkscanoftencancelout. Furthermore,for mostorderings,we arelikely to finda consistentstructurewhich is not too bada fit to thedata;hence,anorderingis unlikely to beuniformly horrible.

The disparity in mixing rates is more pronouncedforlarger datasets.The reasonis quite clear: as the amountof datagrows, the posteriorlandscapebecomes“sharper”sincethe effect of a singlechangeon the scoreis ampli-fied acrossmany samples.As we discussedabove, if ourdatasetis large enough,model selectionis often a goodapproximationto modelaveraging. (Although this is notquite the casefor 1000-instanceAlarm.) Conversely, if

Page 9: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

we considerAlarm with only 100 samples,or the (fairlysmall) geneticsdataset,graphssuchasFigure3 indicatethat structure-MCMCdoeseventuallyconverge(althoughstill moreslowly thanordering-MCMC).

We notethat,computationally, structure-MCMCis fasterthanordering-MCMC.In ourcurrentimplementation,gen-eratingasuccessornetwork is aboutanorderof magnitudefasterthangeneratinga successorordering. We thereforedesignedthe runs in Figure 3 to take roughly the sameamountof computationtime. Thus, even for the sameamountof computation,ordering-MCMCmixesfaster.

Whenboth ordering-MCMCandstructure-MCMCmix,it is possibleto comparetheirestimates.In Figure6 weseesuchcomparisonsfor Alarm. We seethat, in general,theestimatesof thetwo methodsarenot toofarapart,althoughthe posteriorestimateof the structure-MCMCis usuallylarger. This differencebetweenthe two approachesraisesthe obvious question: which estimateis better? Clearly,wecannotcomputetheexactposteriorfor adomainof thissize,sowe cannotanswerthis questionexactly. However,we can test whetherthe posteriorscomputedby the dif-ferentmethodscan reconstructfeaturesof the generatingmodel. To do so, we label Markov featuresin the Alarmdomainaspositiveif they appearin thegeneratingnetworkandnegative if they do not. We thenuseour posteriortotry anddistinguish“true” featuresfrom “f alse” ones: wepick a thresholdñ , andpredictthat the feature

�is “true”

if�����Wºò ñ . Clearly, aswe vary the the valueof ñ , we

will getdifferentsetsof features.At eachthresholdvaluewe can have two typesof errors: false positives— pos-itive featuresthat aremisclassifiedasnegative, and falsenegatives— negative featuresthat are classifiedas posi-tive. Different valuesof ñ achieve different tradeoffs be-tweenthesetwo typeof errors.Thus,for eachmethodwecanplot thetradeoff curvebetweenthetwo typesof errors.Note that, in mostapplicationsof structurediscovery, wecaremoreaboutfalsepositivesthanaboutfalsenegatives.For example,in our biologicalapplication,falsenegativesareonly to be expected— it is unrealisticto expect thatwe would detectall causalconnectionsbasedon our lim-ited data.However, falsepositivescorrespondto hypothe-sizing importantbiological connectionsspuriously. Thus,our mainconcernis with the left-hand-sideof thetradeoffcurve,thepartwherewehaveasmallnumberof falsepos-itives. Within that region, we want to achieve thesmallestpossiblenumberof falsenegatives.

Wecomputedsuchtradeoff curvesfor Alarm datasetwith100and1000instancesfor two typesof features:Markovfeaturesand Path features. The latter representrelationsof the form “there is a directedpath from � to á ” inthe PDAG of the network structure.Directedpathsin thePDAG arevery meaningful:if we assumeno hiddenvari-ables,they correspondto asituationwhere� causesá . Asdiscussedin Section3,wecannotprovideaclosedform ex-pressionfor theposteriorof sucha featuregivenanorder-ing. However, we cansamplenetworksfrom theordering,andestimatethefeaturerelative to those.In our case,(wesampled10 networksfrom eachorder).We alsocompared

50 instances 500instances

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Figure6: Scatterplots that compareposteriorprobabilityof Markov featuresontheAlarm domainasdeterminedthetwo differentMCMC samplers.Eachpoint correspondstoasingleMarkov feature;its

|and Ú coordinatesdenotethe

posteriorestimatedby the greedyseededrun of ordering-MCMC andstructure-MCMC,respectively.

to the tradeoff curve of the non-parametricBootstrap ap-proachof [5], a non-Bayesiansimulationapproachto esti-mate“confidence”in features.

Figure7 displaysthesetradeoff curves. As we cansee,ordering-MCMCdominatesin mostof thesecasesexceptfor one(Pathfeatureswith 100instances).In particular, forñ largerthan � ! î , ordering-MCMCmakesno falsepositiveerrorsfor Markov featureson the 1000-instancedataset.We believe that featuresit missesaredueto weakinterac-tions in the network that cannotbe reliably learnedfromsucha smalldataset.

6 Discussion and future work

In this section,we presenteda new approachfor estimat-ing thetrueBayesianposteriorprobabilityof certainstruc-tural network features.Our approachis basedon the useof MCMC sampling,but over orderingsof network vari-ablesratherthandirectly over network structures.Givenanorderingsampledfrom theMarkov chain,we cancom-putetheprobabilityof edgeandMarkov-blanketstructuralfeaturesusingan elegantclosedform solution. For otherfeatures,we can easily samplenetworks from the order-ing, andestimatetheprobabilityof that featurefrom thosesamples.We have shown that the resultingMarkov chainmixessubstantiallyfasterthanMCMC overstructures,andthereforegivesrobust high-qualityestimatesin the prob-ability of thesefeatures.By contrast,the resultsof stan-dardMCMC over structuresareoften unreliable,as theyarehighly dependenton the region of the spaceto whichtheMarkov chainprocesshappensto gravitate.

We believe that this approachcan be extendedto dealwith datasetswheresomeof the datais missing,by ex-tendingtheMCMC overorderingswith MCMC overmiss-ing values,allowing us to averageover both. If success-ful, we canusethis combinedMCMC algorithm for do-ing full Bayesianmodelaveragingfor predictiontasksaswell. Finally, we plan to apply this algorithmin our biol-ogy domain,in orderto try andunderstandtheunderlying

Page 10: Being Bayesian about Network Structurerobotics.stanford.edu/people/nir/Papers/FK1.pdfstructures is super-exponential: e ` %e ^ w [% b, where S is the number of variables. We can reduce

Markov features

0

10

20

30

40

50

0 10 20 30 40 50

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

0

10

20

30

40

50

0 10 20 30 40 50

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

0

10

20

30

40

50

0 10 20 30 40 50

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

Pathfeatures

0

50

100

150

200

0 200 400 600 800 1000

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

0

50

100

150

200

0 200 400 600 800 1000

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

0

50

100

150

200

0 100 200 300 400 500 600 700 800

Fal

se N

egat

ives

False Positives

BootstrapOrder

Structure

100instances 500instances 1000instances

Figure7: Classificationtradeoff curvesfor differentmethods.The|-axis andthe Ú -axis denotefalsepositiveand false

negativeerrors,respectively. Thecurveis achievedby differentthresholdvaluesin therange, � � �$. . Eachcurvecorresponds

to thepredictionbasedonMCMC simulationwith 50samplescollectedevery200and1000iterationsin orderandstructureMCMC, respectively.

structureof geneexpression.

Acknowledgments The authorsthankYoramSingerforusefuldiscussions.Thiswork wassupportedby ARO grantDAAH04-96-1-0341undertheMURI program“IntegratedApproachto IntelligentSystems”,andby DARPA’s Infor-mationAssuranceprogramundersubcontractto SRI Inter-national.Nir Friedmanwassupportedthroughthegeneros-ity of the Michael SacherTrust andShermanSeniorLec-tureship. The experimentsreportedherewereperformedon computersfundedby anISFbasicequipmentgrant.

References[1] I. Beinlich, G. Suermondt,R. Chavez, and G. Cooper.

The ALARM monitoring system: A casestudy with twoprobabilistic inferencetechniquesfor belief networks. InProc. 2’nd EuropeanConf. on AI and Medicine. Springer-Verlag,Berlin, 1989.

[2] W. Buntine. Theoryrefinementon Bayesiannetworks. InUAI, pp.52–60,1991.

[3] W. L. Buntine. A guideto the literatureon learningprob-abilistic networks from data. IEEE Trans.Knowledge andDataEngineering, 8:195–210,1996.

[4] G. F. Cooperand E. Herskovits. A Bayesianmethodfortheinductionof probabilisticnetworksfrom data.MachineLearning, 9:309–347,1992.

[5] N. Friedman,M. Goldszmidt,andA. Wyner. Dataanalysiswith Bayesiannetworks: A bootstrapapproach. In UAI,pp.206–215,1999.

[6] N. Friedman,M. Linial, I. Nachman,andD. Pe’er. UsingBayesiannetworksto analyzeexpressiondata.In RECOMB,pp.127–135,2000.

[7] W.R. Gilks, S. Richardson,andD.J.Spiegelhalter. MarkovChainMonteCarlo Methodsin Practice. CRCPress,1996.

[8] D. Heckerman. A tutorial on learningwith Bayesiannet-works. In M. I. Jordan,editor, Learningin GraphicalMod-els, 1998.

[9] D. HeckermanandD. Geiger. LearningBayesiannetworks:a unification for discreteand Gaussiandomains. In UAI,pp.274–284,1995.

[10] D. Heckerman,D. Geiger, andD. M. Chickering. Learn-ing Bayesiannetworks: Thecombinationof knowledgeandstatisticaldata.MachineLearning, 20:197–243,1995.

[11] D. Heckerman,C. Meek, andG. Cooper. A Bayesianap-proachto causaldiscovery. MSR-TR-97-05,Microsoft Re-search,1997.

[12] E. Lander. Array of hope.Nature Gen., 21,1999.[13] D. MadiganandE.E.Raftery. Modelselectionandaccount-

ing for model uncertaintyin graphicalmodelsusing Oc-cam’s window. J. Am.Stat.Assoc., 89:1535–1546,1994.

[14] D. Madiganand J. York. Bayesiangraphicalmodelsfordiscretedata.Inter. Stat.Rev., 63:215–232,1995.

[15] N. Metropolis, A.W. Rosenbluth,M.N. Rosenbluth,A.H.Teller, andE. Teller. Equationof statecalculationby fastcomputingmachines.J. ChemicalPhysics, 21:1087–1092,1953.

[16] P. M. Murphy andD. W. Aha. UCI repositoryof machinelearning databases.http://www.ics.uci.edu/ mlearn/MLRepository.html,1995.

[17] F.C. PereiraandY. Singer. An efficient extensionto mix-ture techniquesfor predictionanddecisiontrees. MachineLearning, 36(3):183–199,1999.