compatible prior distributions for directed acyclic graph models

Compatible Prior Distributions for Directed Acyclic Graph ModelsAuthor(s): Alberto Roverato and Guido ConsonniSource: Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 66, No.1 (2004), pp. 47-61Published by: Wiley for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/3647626 .

Accessed: 28/06/2014 12:45

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access toJournal of the Royal Statistical Society. Series B (Statistical Methodology).

http://www.jstor.org

This content downloaded from 141.101.201.31 on Sat, 28 Jun 2014 12:45:24 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=black

http://www.jstor.org/action/showPublisher?publisherCode=rss

http://www.jstor.org/stable/3647626?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


J. R. Statist. Soc. B (2004) 66, Part 1, pp. 47-61

Compatible prior distributions for directed acyclic graph models

Alberto Roverato

University of Modena and Reggio Emilia, Italy

and Guido Consonni

University of Pavia, Italy

[Received June 2001. Final revision April 2003]

Summary. The application of certain Bayesian techniques, such as the Bayes factor and model averaging, requires the specification of prior distributions on the parameters of alternative models. We propose a new method for constructing compatible priors on the parameters of models nested in a given directed acyclic graph model, using a conditioning approach. We define a class of parameterizations that is consistent with the modular structure of the directed acyclic graph and derive a procedure, that is invariant within this class, which we name reference conditioning.

Keywords: Bayes factor; Directed acyclic graph; Fisher information matrix; Graphical model; Group reference prior; Invariance; Jeffreys conditioning; Reference conditioning; Reparameterization

1. Introduction

Model comparison is an important topic in statistical theory and practice. In particular, Bayes- ian model comparison, which essentially relies on the Bayes factor, has been an active area of research lately; see Kass and Raftery (1995) for a comprehensive review. The Bayes factor for comparing two models requires the assignment of two prior distributions (one for each model parameter).

Specifically, assume that there are two models M and Mo, say, for the same observable X, with parameter space respectively E and o0. Write H and IHo for the corresponding prior distributions, and assume, for simplicity, that Mo is a submodel of M. If nothing was elicited to indicate that the two priors should be different, then it is sensible to specify no to be, in a sense to be made precise, as close as possible to II. In this way the resulting Bayes factor should be least influenced by dissimilarities between the two priors due to differences in the construction processes and could thus more faithfully represent the strength of the support that the data lend to each model. Despite being clearly important, this issue has not been adequately dealt with in the literature so far, at least from a foundational perspective. A notable exception is Dawid and Lauritzen (2001), who suggested two strategies, which they named 'projection' and 'conditioning', to choose a prior HIo that is compatible with the given II. The objective of this paper is to elaborate on the 'conditioning approach'. Specifically we consider statistical models that are based on directed acyclic graphs (DAGs). These models are very useful

Address for correspondence: Guido Consonni, Dipartimento di Economia Politica e Metodi Quantitativi, Via S. Felice 5, 27100 Pavia, Italy. E-mail: [email protected]

? 2004 Royal Statistical Society 1369-7412/04/66047



48 A. Roverato and G. Consonni

in many applied domains, and they represent the architecture of probabilistic expert systems and belief networks; see for example Cowell et al. (1999). DAG models may be regarded as simply depicting conditional independence relationships among the random variables involved, say X = (X1,..., X,) or alternatively as 'causal' models. For instance, the latter interpretation arises when, in the absence of unmeasured confounders, there is qualitative prior information that specifies constraints on the ordering of the random variables, as in the context of univar- iate recursive data-generating processes described, among others, in Cox and Wermuth (1996) and Lauritzen and Richardson (2002). In such models, the joint distribution of the observ- ables is not the primitive notion, but rather the end result of the specification of a collection of local conditional distributions. If X1 denotes the most recent response, and X, the last, purely explanatory, variable then a generating process starts with the marginal distribution of X, and progressively generates observations on a response variable Xi, i = 1,..., v - 1, conditionally on a subset of the potential ancestors Xi+l,..., X,. A related interpretation of causal DAG models is explicated in Spirtes et al. (1993), Pearl (2000) and Lauritzen (2001). In this context we distinguish between conditioning by observation and conditioning by intervention, with the causal DAG providing the relevant formulae for the latter.

The conditioning approach of Dawid and Lauritzen (2001) rests on the idea that a rule to construct compatible priors should be invariant to reparameterizations. This is achieved by means of a procedure which they named Jeffreys conditioning, since it relies on the Jeffreys measure. We stress the unconventional use of the Jeffreys measure in this context, namely not as a possibly uninformative prior, but rather as a half-way house towards obtaining a compatible prior. We argue that, under the assumption that the ordering of the variables is fixed, invariance to any reparameterization is not appropriate for (causal) DAG models, precisely because such models incorporate a modular structure which should be preserved. We therefore favour a more restrictive notion of invariance. A base-line measure that is consistent with our requirement of invariance is shown to be the group reference prior of Berger and Bernardo (1992), and accordingly we name our method reference conditioning. Our method is general and automatic and requires a single specification of the prior corresponding to the largest model entertained. In particular we show that reference conditioning is especially straightforward for the Gaussian case.

The structure of the paper is as follows. Section 2 describes the conditioning procedure to find compatible priors. Section 3 is devoted to the crucial issue of reparameterization for DAG models, motivating and characterizing the class of modular parameterizations. Section 4 intro- duces the concept of reference conditioning which is then elaborated for Gaussian DAG models in Section 5. Finally, Section 6 offers a brief discussion of some of the issues that are involved.

2. Compatible priors by means of conditioning

Consider a model AM, i.e. a family of distributions on the given sample space, and let Mo C MA denote a submodel. Let H be a probability on A4, representing prior uncertainty. Given a measurable function T : A4 - T, under mild regularity conditions a regular version of the conditional distribution of n given 7, say n(-T17 = -), is available satisfying II(T = cI7 = c) = 1 for almost all c.

Now assume that the submodel Mo is defined as MO = {P: r(P) = T70o} for some To: we call 7 an 'Mo-adapted' constraint function. If, for a given n on M, we wish to derive a probability measure IIo, say, on Mo, it is tempting to equate conditioning on Mo with conditioning on 7 = To, which amounts to setting IIo(.) = I(.T17 =

T•0). This conditioning procedure may appear

a natural way to construct compatible distributions; however, it is not invariant with respect to the specific Mo-adapted constraint function that is chosen to identify Mo. As a consequence, if



Compatible Prior Distributions 49

the same submodel is identified through an alternative Mo-adapted constraint function ?, say, such that Mo = {P : ((P)=0o} for some ?o then II(. I = 0) may provide prior information that is different from that conveyed by II(IT = 0To). This phenomenon is called the Borel- Kolmogorov paradox.

We can recast the previous abstract formulation in the more conventional setting of a model parameterized by 0 E E, where the parameter space E is an open subset of 91q. Let Mo be a submodel of M identified by O0 = {0 E E : t(O) = to}, where t := t(.) is a measurable function on

O and to is a suitable (vector-valued) constant. Given a probability measure II on E, representing prior uncertainty about 0 e O, denote with H(. It = -) (a regular version of) the corresponding conditional distribution. As described in the general setting, an obvious way to derive a prior distribution HI0 on Eo is by means of conditioning, i.e. by setting no(.) = II(. It = to). Specifi- cally, we restrict attention to the case in which 0 = (00o, 00)T, with 00 and 0-,o disjoint subvectors of 0, having dimension r and q - r respectively. Furthermore, we take IH absolutely continuous with respect to Lebesgue measure on O, having density function ir, and set t(O) = 0--, so that

Oo = {0 E : O 0- = to}. It follows that IIo is absolutely continuous with respect to Lebesgue measure (on NYr) and has density function 7ro(0) oc 7r(0), 0 E O0.

As mentioned earlier, this conditioning procedure is subject to the Borel-Kolmogorov paradox, so, if the same submodel were identified through a different measurable function on O, s := s(-) say, then II( Is = so) may provide prior information that is different from that conveyed by IIo.

We remark that in statistical practice the choice of the conditioning function is typically driven by the parameterization that is used to describe the model. More precisely, consider a one-to- one (smooth) reparameterization 0b := 0(0) and let II0 = H o b-1 be the distribution that is induced by II through 0(.) on T = =b(O), whereas 7r (O) represents the corresponding density function with respect to the Lebesgue measure. Furthermore, assume that there is a partition of 0 into subvectors b0 and 0-,o such that o0 = O(Oo) = {f0 : o-, = c} for some constant c. In this parameterization it might appear natural to induce a distribution on Io by conditioning -II on 0--,0 = c, which is equivalent to conditioning H on s = c with s(O) = 0--,0. This explains why the paradox may be informally referred to as a lack of invariance of the conditioning procedure to model reparameterization although, strictly speaking, what really matters is the conditioning function, i.e. the functional form s(O) rather than the actual parameterization that is employed. Before presenting an example of the paradox we introduce some notation.

Let V = {1,..., v}; for A c V, let XA denote the random vector having components Xi with i e A. Suppose now that Xv - X has a multivariate normal distribution with expectation equal to 0 and covariance matrix E = {cij}. Instances of alternative parameterizations of this model are the concentration matrix E-1 = {cij}, the matrix P with variances

'ii on the main diagonal

and correlations Pij = Uij/VJ(aiiajj) in the off-diagonal entries, the matrix R with elements "ii on the main diagonal and partial correlations

Pij.V\{i, j]} = -' JJ)

in the off-diagonal entries. Recall that "ii 1/= ii.V\{i}, where oii.V\{i} is the variance of the conditional distribution of Xi given the remaining variables Xv\{i) (see Whittaker (1990), page 143).

2.1. Example 1 (Dawid and Lauritzen, 2001) Let M be a bivariate normal model with zero mean and covariance matrix Z, and let Mo be the submodel with X1 independent of X2, denoted by X1 iL X2. This constraint can be




expressed in several alternative ways, depending on the function of E that is adopted to identify the submodel. Dawid and Lauritzen (2001) considered the following cases:

(a) 012 = 0, (b) al2 - 0, (c) P12 = 0, (d) /12 = 0

where 012 = o12/U22. Let E have an inverse Wishart distribution, E IW(6, A) where 6 is a positive constant and A = {aij } is a positive definite matrix. Here we use the notation of Dawid (1981) so E-1 W(6 + 1, A- ). In case (a) we obtain

ll ~

aI1//X+2, '22 a22/X +2, independently,

where2 X denotes a x2 random variable with g degrees of freedom. However, different answers result from the application of the same procedure with respect to the alternative functions (b)-(d)

l

~1 all/x~, 0V22 a22/X , independently,

all all/X2+1, /22 a22/X2+1, independently,

ll a/+2, 22 a22/ , independently

respectively. Example 1 may be usefully reinterpreted in terms of graphical models. For the theory of

graphs and graphical models that is used in this paper we refer to Lauritzen (1996), Cowell et al. (1999) and Lauritzen (2001), where the definitions of undirected graph, decomposable graph, DAG, perfect DAG, the undirected version of a DAG, parent, family, conditioning by intervention and conditioning by observation are provided.

Here, we denote a graph by g = (V, E) where V is the set of vertices and E is the set of edges and to emphasize that the graph is a DAG we write D = (V, E). For a DAG D = (V, E) and a vertex a E V, pa(a) and fa(a) denote the parents and the family of a respectively. It is always possible to identify a well-ordering (al, a2,... , a) of the vertices V of a DAG such that, if two nodes are joined by an arrow, the edge points from the vertex with higher position to the vertex with lower position in the ordering. Note that the traditional definition of well- ordering (see Cowell et al. (1999), page 47) is reversed here. A DAG may not have a unique well-ordering.

A graphical model is a family of distributions for the vector X satisfying a set of conditional independence relations encoded by a graph. Two graphs are said to be Markov equivalent if they encode the same set of conditional independence assertions (Frydenberg, 1990; Verma and Pearl, 1990; Andersson et al., 1997). Markov equivalence is an equivalence relation and we denote by [D] the set of all DAGs that are Markov equivalent to D. We remark that all D* e [D] share the same undirected version D~. Furthermore, D~ = (V, E'), together with any well- ordering of the vertices of D, fully identifies D. In example 1 model /M may be described by any of the following three Markov equivalent complete graphs: the undirected graph, 1.• 2, the DAG with well-ordering (1,2), 1 - *2 and the DAG with well-ordering (2, 1), 1. -* .2. The submodel Mo corresponds to the graph l1 @2.

2.2. Jeffreys conditioning Dawid and Lauritzen (2001) introduced the following generalization of the usual conditioning procedure to construct compatible priors.




Consider two base-line measures, v and vo, on O and G0 respectively. For a distribution II in 0, absolutely continuous with respect to v, we define the compatible distribution HI0 in Go to be a distribution for which the density function of H with respect to v, dll/dv, and that of no with respect to vo, dlo/dvo, are proportional. If v is a probability measure, and vo is obtained from v by conditioning on t = to, then the above method gives the same answer as conditioning H on t = to. In other words, as underlined by Dawid and Lauritzen (2001), the arbitrariness in the form of the function that is used to identify the submodel is replaced by the arbitrariness in the form of the measures v and vo. Arguably, however, choosing the latter measures to be intrinsic to the models, as well as independent of specific ways of parameterizing them, would lend greater reasonableness to the whole procedure. A natural choice is represented by the Jeffreys measure, whose density with respect to Lebesgue measure is given by j(0) = IH(O) 11/2, where IH(O) I is the determinant of the Fisher information matrix for 0, and the resulting procedure is named Jeffreys conditioning.

If 7r is the density function of II with respect to Lebesgue measure, then the distribution on

O0 resulting from Jeffreys conditioning has density function with respect to Lebesgue measure

7ro(0) oc

0r(O)

0 E 0o.

(1) j(O) '

We recall that in our setting 0 = (O0, 0.0)T with O0 = {0 E : 0-, = to}, so it may be helpful, in

actual computations, to express the resulting density ro only in terms of 00. Even though Jeffreys conditioning is stated in terms of a specific parameterization as in expression (1), we emphasize that the procedure is invariant to the choice of such a parameterization, as shown in Dawid and Lauritzen (2001). Specifically, let b be an alternative parameterization as described in Section 2. If 7ro (4), E E 0o, is obtained by applying Jeffreys conditioning with respect to irO (b), L e I, then we can switch from 7ro (Vb) to 7ro(0) by the usual change-of-variables technique.

Assume now that the ratio of the densities of the Jeffreys measures for Vb under M and MA is constant for all 0 E To. In this case, the application of expression (1) leads to 70 () oc

), i.e. Jeffreys conditioning gives the same answer as conditioning on V--o

= c. In this sense we may say that b is a parameterization implicitly leading to Jeffreys conditioning for M and Mo, although, as remarked earlier, we could formulate the argument in terms of the conditioning function s(O) = 0-,o, rather than in terms of the parameterization 0(0). For the problem in example 1, Dawid and Lauritzen (2001) showed that R is a parameterization implicitly leading to Jeffreys conditioning.

3. Directed acyclic graph models and reparameterizations Consider a DAG D = (V, E) with V = { 1,..., v} and assume, without loss of generality, that (1,2,...., v) is a well-ordering of V. For each i E V we specify a family MAD of local conditional densities

p(xi Ixpa(i), rli), Ti E Hi, (2)

with the local parameters ri variation independent. Multiplying the conditional densities (2) we obtain the family M := M' x ... x M? of distributions for X = (X1,..., X,) with joint density

p(xI7/) := p(xilXpa(i), ri), (3) i=1

where r := (ri,,..., ,), with r H := H1 x ... x H,.




Although expression (3) specifies a distribution which is directed Markov with respect to any D* E [D], we take the view, as discussed in Section 1, that the vertices of P are arranged in a causal order (see also Cowell et al. (1999), page 259) and assume the absence of unmeasured confounders that alter the causal interpretation of arrows; see Lauritzen and Richardson (2002) for a detailed discussion of this point. An important special case of this situation is when expression (3) is the result of a structural assignment system, as described in Lauritzen and Richardson (2002), equation (4). It is worth pointing out that in this case the distribution of X is causally Markov with respect to D (Lauritzen (2001), theorem 2.20).

In the causal interpretation of DAG models, the local conditional families M[is

represent the primitive modules of the overall model AM4), in the sense that they are assumed to remain invariant under (conditioning by) intervention. Consequently, the partition of Tr into v groups, each being associated with a local family, is an integral part of the parametric structure which must be preserved across reparameterizations. The previous remark implies that a DAG model cannot be arbitrarily reparameterized; to clarify this point and eventually to reach a definition of modular parameterization we proceed in steps. Consider first a (smooth) one-to-one transformation of rl, 0 say, such that 0 = (01,... , 0,) and Oi is a bijection of 7i, for each i. Since each Oi is uniquely associated with

MP/, we say that this transformation is modular with respect to

D. However, a more general situation could be envisaged, namely one in which 0i is no longer a bijection of ri, and yet Oi can still be unambiguously linked to MA4. This is formalized in the following definition.

Definition 1. Consider the family ME with density (3). Let 0 = 0(r) be a bijection of r =

(il,..., ~v,). We say that 0 is D modular if it admits a grouping 01,..., 0,, with dim(0i) = dim(ri), and an ordering (j1,..., j,) of the indices (1,... , v) such that

0 - k(4) 01j = 0jV Oj,..,rjl),

where, for fixed (qrj,,..., rlik_l), Ojk is a one-to-one function of rj•k,

k = 1,... , v.

Remark 1. Clearly the last equation in expression (4) is always satisfied. Furthermore, as suggested by a referee, the conditions in the above definition are equivalent to requiring that

(0j*,...-, O-i,) is a one-to-one function of (rj,, . ..., r1mk) for all k = 1,..., v.

The property of D-modularity ensures that each component Ojk can be written as a function of

(rjik, Ojl,..., Ojk-_ ) where, for fixed (0j,,..., Ojk-1 ), Ojk is a one-to-one function of rjk, whereas,

for i = k, Oji can depend on rj, only indirectly through Ojk. For instance, for v = 3, jl = 1, j2 = 2 and j3 = 3, we have

02 = 02(r2, 01),

03 = 03(]3, 01,02).

In this sense we can say that Ojk is unambiguously associated with the local conditional family J1k




3. 1. Example 2 Consider two local conditional Gaussian families corresponding to the DAG 7D 9 +-- 2, such that (X11X2 = x2, 71) - N(412x2, c11-2) and (X21r/2) ~ N(0, 22), where qn = (012, c11.2) and T12 = c22. As in example 1, the joint distribution of (X1, X2) is bivariate normal with zero mean.

(a) Consider the transformation from r to the local conditional parameters in the reverse DAG D7* 1. --+ 2, namely q* = (021, c22.1, ll). It can be verified that

011.2922 022.1 =-

/312222

+ ll1.2

/312o22 /321 2 /12(22 + Url.2

and

a11 = 1222 + 9112- Hence, this transformation is not 79 modular, although D7* is Markov equivalent to 79. For example, letting 01 (r) = (321, a22.1) and 02(7/) = all, we immediately realize that 01 and 02 are each simultaneously a function of both 7ql and r/2. Clearly no other grouping would overcome this difficulty.

(b) Consider the transformation q/ E. Because of symmetry, only the upper diagonal elements, say, including the diagonal itself, need be considered. If we set 01(r,) = (a11, cr12) and 02(r/) = cr22, then 02 is a function of r/2 only, and no further check is required to conclude that the transformation is D modular.

(c) It is straightforward to verify that we still have a D modular parameterization if in the previous point E is replaced by P, so that 0(r/)

= (rll, P12, c22).

(d) Consider now the transformation q - E-1 and let 01(rT) = (cr11, ac12) and 02(q7) = a22.

By recalling that P12 = -Uo12/cr11

= 12/o22

(see Cox and Wermuth (1996), page 69) we can write

01(r71) = (1/911.2, -

/12/o112), (5)

02(0/) =

(/3220722

+ 711.2)/'11.2922-

Since 01 is a function of r7l only, then E-1 is D modular. (e) In parallel with case (c), consider the transformation

r7-/ *R= {cr",P12,~r22}. Using

equations (5) we can re-express R in terms of (71, r/2), to show that no grouping of the elements of R can satisfy expression (4); hence R is not D modular.

Working through example 2 one can derive that E, P and E-1 are also D* modular, whereas R is neither D nor D79* modular. Interestingly R is the parameterization implicitly leading to Jeffreys conditioning for this problem.

4. Reference conditioning

According to the theory that was developed in the previous section, a procedure for the construction of compatible priors for DAG models can be obtained by applying Dawid and Lauritzen's (2001) conditioning method with respect to two measures, v and vo, with the only requirement that each be intrinsic to the model and invariant within the class of modular parameterizations. The Jeffreys measure, being invariant with respect to any reparameterization, can clearly be used; however, it may lead to questionable results as illustrated in the following example.




4. 1. Example 3 Let MA := MAf' x MD be as in example 2 so that 7= (ril,r2) with q~i=(c11.2, f12) and 772 = 22. The submodel MD)0 := AM x MA4 , wherein X1 is independent of X2, differs from MD only with respect to the first conditional density, i.e. MD= M

?o. Assume that, under

MD, 771I IM2, so that (X1, 71) Ii -172X2. Often prior distributions on rq and 772 may be regarded as posteriors from MD with (possibly fictitious) complete data on X1 and X2. As a consequence, the prior on 172 should depend on 'prior' data on X2 only. This is trivially true under MDO also, and thus the prior distributions on 7r2 should be the same under the two models, so that the Bayes factor to compare ME and MDo would be identically 1. This is an instance of prior modularity; see Heckerman et al. (1995). If, as in example 1, we take E - IW(6, A) then the induced prior distribution for 77 under MD is such that T 11 A'L 2 with /2 --

22/X2 (see example 6-(iii) of Dawid and Lauritzen (2001)); however, the compatible prior distribution for r12 under MDo, obtained through Jeffreys conditioning, is a22/X +1; see point (c) of example 1.

The unsatisfactory results deriving from the application of Jeffreys conditioning in DAG models can be regarded as a consequence of the fact that the prior information on the modular structure, which in turn gives rise to the grouping of r into (i7,... , v,), does not enter the definition of the procedure. This suggests that we should revert to a conditioning approach whose base-line measure takes into explicit consideration the above parameter grouping. We next show that this can be obtained by means of the ordered group reference prior.

Reference priors were introduced by Bernardo (1979) and generalized to multiparameter problems by Berger and Bernardo (1992); see also Bernardo and Smith (1994) for the definition of reference priors and a description of the procedure to derive them. Here in particular we shall restrict our attention to the so-called regular case, which ensures asymptotic normality of the posterior distribution of the parameter. We wish to emphasize that a reference prior for a parameter 0 is derived with respect to an ordered grouping of the elements of

0 E E. However, if the group components are likelihood independent (i.e. variation independent with block diagonal Fisher information matrix), as in the case of the modular components of the n-parameterization for DAG models, then the ordering of the grouping is irrelevant; see Gutierrez-Pefia and Rueda (2003) and remark 2 below. Accordingly we shall employ the term group reference measure for T7 and denote its density with respect to Lebesgue measure by r(rq). Datta and Ghosh (1996) showed that such a measure is invariant with respect to the class of reparameterizations satisfying expression (4), i.e the class of D-modular parameterizations.

Along with Jeffreys conditioning, we can now define a new approach, named reference conditioning, by imposing that v and vo be group reference measures. For a given well-ordering (1,2,..., v) of the vertex set V, let D = (V, E) and Do = (V, Eo), E0 C E, be two DAGs, and denote by AMD and MDO the related DAG models; see equation (3). Now partition r e H as

7q = (rio, l_,0)T, so that the parameter space Ho, corresponding to MAD, is identified through the constraint r7-,0 = c. If 7r is the density with respect to Lebesgue measure of the prior distribution II on H, then the distribution on Ho resulting from the application of reference conditioning has density function with respect to Lebesgue measure

ro(ri) rO(ri) ocx ,r(r)

r

, 'E

Ho. (6)

Remark 2. Reference conditioning may be defined with respect to any modular parameterization 0 of MD. However, if the group components of 0 are not likelihood independent, the reference measure must be derived with respect to the ordered grouping (0j,,..., 0j,) of definition 1. For instance, if in example 2 the parameterization E is used, then the reference measure




must be computed with respect to the ordered groups 01 = 022 and 02 = (al1, c12) for A4, but with respect to the ordered groups 01 = aoiI and 02 = (c22, 9l2) for Ma)*.

In parallel with the discussion at the end of Section 2.1, suppose that there is a D-modular parameterization b = '(rT) such that the ratio of the densities of the reference measures for b under MDE and MDO is constant for all 0 e 0T = 4(Ho). Then the application of expression (6) leads to 7ro (0) oc 7rr (0) and we say that 0 is a parameterization implicitly leading to reference conditioning for ME and MDO.

5. Gaussian directed acyclic graph models

A Gaussian DAG model is specified through equation (3), wherein each local conditional-or regression-family is Gaussian. The ith family, i = 1,..., v, is typically parameterized by

rli = (3i, Oii.pa(i)),

(7)

with 3i = (/ij, j E pa(i)) representing the regression coefficients. Clearly the local parameters qris are variation independent. The rest of this paper is entirely devoted to this case, so in the following MP• will denote the ith local regression family, parameterized by equation (7), and M•D the overall Gaussian DAG model. We remark that any D-modular parameterization Oi = (Oij; j E fa(i)), i = 1,..., v, with the Ois variation independent, would be equivalent for the development of the theory in this paper. One such parameterization, which plays a key role here, is given by 5b =

(•1b,..., ,bo) where

Oii = 1/v ii.pa(i), (8)

oij = --/ij/V0ii.pa(i),

j E pa(i),

with the understanding that (1, 2,..., v) is a well-ordering of the vertices so that pa(i) _

{i + 1,..., v}. If D denotes the upper triangular matrix with entries 4ij and 0 elsewhere, then 4TD = E-1 represents the Cholesky decomposition of E-1 (see Wermuth (1980) and Roverato (2000, 2002)).

With reference to the bivariate case, example 2 has already provided instances of modular, and non-modular, parameterizations. These results are extended and generalized below. Con- sider the class [D] of DAGs that are Markov equivalent to D. For all D* e [D], E, E-1, P and R are all one-to-one transformations of the parameterization ri* of M '•*. We remark that, when D is not complete, E has free entries aij with j e fa(i), whereas all the remaining entries are functions of these. The same is true for -

1, P and R.

Proposition 1. Let Ma be a Gaussian DAG model. Then E, -1 and P belong to the class of D*-modular parameterizations for all D* e [].

For a proof, see Appendix A. However, example 2 makes clear that in general the matrix R does not belong to the class of

D*-modular parameterizations for all D* E [D], and that for all D* e [D]\{D} the standard parameterization r* of MED* is not D* modular.

Before turning to more technical matters, we remark that starting from two models with Mar- kov equivalent DAGs D and D* and consistent prior distributions for rj and rj* (in the sense that they are induced from the same prior on Z, say), reference conditioning may lead to different compatible priors for the parameter ri0 of a common submodel. This is not surprising since the prior on ri0 is chosen to be compatible with either the prior on r or that on 77*, depending on whether the assumed DAG is D or D*. We must remark, however, that our method is not meant




to be used when several Markov equivalent models are under consideration, because a fixed ordering of the variables is assumed throughout.

5. 1. Reference conditioning For Gaussian DAG models the method of reference conditioning is easily implementable using the modular parameterization $ described in expression (8).

For a complete DAG model the Fisher information as well as the group reference measure for $ can be easily deduced from that of Consonni and Veronese (2003). These results can be extended to an arbitrary DAG model as follows.

Theorem 2. Let M•D be a Gaussian DAG model with DAG D = (V, E). With respect to the parameterization 0 described in expression (8),

(a) the Fisher information is block diagonal with the ith block given by

Hii(q$) =

'- i(2 0>piT9

where TV is the unique upper triangular matrix with positive diagonal entries such that

Efa(i),fa(i) = p i (,Fi)T and I is the identity matrix with dimension equal to the cardinality of the set pa(i), and

(b) the group reference measure for the parameterization 0 has density with respect to Lebesgue measure r(b) oc IIH=11/ii.

For a proof, see Appendix A. As an immediate consequence of theorem 2 we obtain the following corollary.

Corollary 1. Let MD be a Gaussian DAG model with D = (V, E) and M•Do a submodel with Do = (V, Eo) and Eo C E. Then 0 is a parameterization implicitly leading to reference conditioning for MD and MDO so that reference conditioning is equivalent to conditioning the distribution under MA on (iij = 0; (i, j) E E\Eo).

Proof We must prove that ro(O)/r(o) is constant for all 4 : (Oij = 0; (i, j) e E\Eo) and this is obviously true because both ro(q) and r(q) are proportional to HII 1 ll/ii. o

In DAG modelling it is common practice to specify prior distributions which satisfy the property of global parameter independence; see Spiegelhalter and Lauritzen (1990). In our context this amounts to assuming that the groups b 1,..., q, of b are mutually stochastically independent, so that reference conditioning can be performed separately for each component 4i. Trivially, the resulting compatible distribution for the parameter of a submodel will also satisfy the global parameter independence property. Moreover, if a local regression M of MD is unchanged in the submodel, then reference conditioning leaves the corresponding parameter prior also unchanged, so that prior modularity is automatically satisfied.

The inverse Wishart distribution represents the most-used prior distribution for the parameter E of a complete Gaussian DAG model. In this case the distribution of q is easily derived and can be shown to satisfy global parameter independence. Furthermore reference conditioning can be routinely applied by using corollary 1, thereby providing a practical method for assigning parameter priors to candidate DAG models via a small number of direct assessments.

5.2. Example 4 For the problem in example 2 let E ~ IW(5, A) as in example 1. It can be checked, by using a result cited in Roverato (2002), result R.7, that ( 11, b12) 1 q22, with 21 -~ X2+1/al 1.2, b12 11 b




N(-1llal2/a22,a221)22, where and 222 w/ll-2 . all - a212/a22. Applying reference conditioning, i.e. conditioning on {102 = 0}, we obtain aoll all/X2+i and

-22 " a22/X2 with

all I IL 22 which is different from any of the results obtained in example 1. In particular, unlike the result (c) arising from Jeffreys conditioning, the distribution of

-22 = 1/ 22 is the same

under the two models.

6. Discussion

In this paper we consider DAG models wherein the order of the variables is fixed; this specifies a collection of local conditional families which represents the modular structure of the DAG model. It is the latter aspect, as opposed to vertex ordering, which is emphasized throughout the paper. The Tr-parameterization of equation (3), or alternative one-to-one componentwise transformations such as 0 described in equation (7), naturally encapsulates this modular structure, since its components are variation independent, and each uniquely parameterizes a local family. Reference conditioning does not depend on the order of the ql-components, whereas, when applied to an alternative D-modular parameterization, the ordered groupings follow automatically from expression (4) of definition 1.

Reference conditioning is invariant within the class of D-modular parameterizations. Argu- ably, this property may be shared by other base-line measures, distinct from the reference measure. We contend, however, that the reference measure is model intrinsic and appears as a very natural choice when trying to generalize Jeffreys conditioning: indeed the latter becomes a special case when no ordering of the variables is specified, or equivalently when no modular structure is entertained, as for example with undirected graphical models.

Reference conditioning only requires the specification of a prior distribution for the parameter of the most complex (possibly complete) model; prior distributions for the parameters of each submodel are then induced through conditioning. In particular, for Gaussian DAG models, we show that, if the starting prior satisfies the property of global parameter independence, this property propagates through reference conditioning, thus facilitating prior-to-posterior inference under complete sampling. Furthermore prior modularity is automatically satisfied. Similar results hold for discrete DAG models, wherein the random variables are jointly distrib- uted according to a multinomial law; see Leucari and Consonni (2003).

The scope of prior modularity is indeed more restrictive in our approach than in the original formulation of Heckerman et al. (1995), wherein the whole class of Markov equivalent DAGs is considered. This is, however, consistent with our starting assumption, namely that the ordering of the variables is fixed, so that at most one specific DAG for each equivalence class need be considered. This restriction offers the bonus of extra flexibility when it comes to prior modelling: for example, in the Gaussian case, we are not confined to using the inverse Wishart family, as instead we ought to if prior modularity, together with global parameter independence, were required for the entire class of Markov equivalent DAGs; see Geiger and Heckerman (2002).

Acknowledgements This work has greatly benefited from the comments of an Associate Editor and two referees. In particular the detailed comments of one referee on the general issue of conditioning were especially illuminating, as were the ensuing clarifications that were provided by Patrizia Berti and Pietro Rigo. We extend our thanks to Christian Robert for helpful discussions during our stay at the Centre de Recherche en Economie et Statistique, Paris, through a European Union training and mobility of researchers award (ERB-FMRX-CT96-0096) visit, as well as to David Madigan,




Department of Statistics and Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University, for hosting us during the revision of this paper. Financial support was provided by the Ministero dell'Istruzione dell'Universit"a e della Ricerca, Rome ('Progetti di ricerca di interesse nationale', 2001), the University of Pavia ('Research project on the statistical analysis of complex systems') and the Center for Discrete Mathematics and Theoretical Computer Science.

Appendix A: Proofs

A. 1. Proof of proposition 1 Let D* be any DAG in the set of Markov equivalent DAGs [D] and let fl* denote the standard parameterization of MD'. We denote by q* the alternative parameterization of MD' defined by expression (8) and by 4V* = {0} the matrix that is obtained from q0* as described in Section 5, so that E*-' =

-,,x. It is worth pointing out that, although E-' is the same for all the DAGs in [D], by deriving it from V* we need to specify some well-ordering (al,5..., a,) of the vertices of D* so that also the rows and columns of E will follow the same ordering. Two Markov equivalent DAGs differ for the well-ordering of their vertex sets, and we add an asterisk to E-1 when row and column well-ordering are of relevance for the operation being performed, and similarly for E and P.

We first show that E-1 is D* modular. Consider the group ordering of the unconstrained entries of E-1 given by

(a*1,...., *V ) with *r = (ai, j e fa*(i)) where fa*(i) denotes the family of i with respect to

D*. Note that a*i is made up of the unconstrained entries of the ith row of E*-'. It can be easily checked that the first row of E*-1 is a function of the first row of P*, namely a*' = a c*l ((), the second row of E*-' is a function of the first two rows of 4*, a*2 =

o*2( *, qb'), and so on. Thus this grouping satisfies

definition 1 and we can conclude that E-' is D* modular. We proceed in a similar way for E. Consider the grouping (/ a

...u) with a* = (aij, j E fa* (i)) so

that a* is made up of the unconstrained entries of the ith row of E*. Recalling that, for all i = 1,..., v, i., v} then B= (B ~B'B)-1 (see Roverato (2000)) we obtain that ~* is a function

of the last i rows of V*, i.e. a*= (*, 1, ... ). As a consequence E is D* modular. To show that also P is D* modular we put (p,...,p) with p* = (ii, pij; j e pa* (i)). It is easy to check that

= * ,...,) for i = 1,...,v, so that p* is itself a function of (0", *il,...,

q) and this completes the proof.

A.2. Proof of theorem 2 Firstly, the parameter q is made up of variation independent groups (01,..., ,) and it is clear from equation (3) that the Fisher information matrix for q is block diagonal with ith block given by

) = -E (2 [log{p(Xi Xpa(i), i)}]) Hii()

= - E

Ti,

(9)

where Xfa(i) N(O, Efa(i),fa(i)) and p(. I) denotes the density function of the conditional distribution of Xi given Xpa(i). Therefore, the computation of the ith block of H(O) involves only the distribution of Xfa(i). More precisely, p(xilXpa(i),0i) in equation (9) is the density of the first local regression of any complete DAG model for Xfa(i) with Xi as pure response variable and Hii(q) can be computed locally with respect to such a complete DAG model.

Let D' = (fa(i), Ei) be any complete DAG such that i is the first vertex, i.e. the set of parents of i in D is pa(i), and assume that the rows and columns of

E•1 ra() are ordered according to the vertex

?fa . (), . ordering. In this way, the upper triangular matrix V' that is obtained from the Cholesky decomposition fa()-1 rai) = (i') 4' provides the -parameterization of MI' and the first row of .i is qi as in equation (9).

We first give the Fisher information matrix for the inverse variance, H(11faat(i)), and derive the Fisher information for Q' by using the relationship H(Qi) = JT H(- ) J where J is the Jacobian of the transformation from n ia

'fa)oi -

tion from to . In H(Ei), fa(i)) we take the distinct elements of Efa(i),fa(i) ordered according to the rows of

i-,fa(it), and similarly for H(Q'). Thereby, H(<') is block diagonal and its first block is

Hii (q) . Hereafter we make use of the theory related to commutation, duplication and elimination matrices. For

a p x p matrix A let vec(A) be the p2 column vector made up of the columns of A and vech(A) the p(p+ 1)/2 column vector that is obtained by stacking the columns of A from the principal diagonal downwards. The




commutation matrix K,, is defined as the p2 x p2 matrix such that vec(AT) = Kp vec(A) for any p x p matrix A; the duplication matrix Dp is the p2 x p(p + 1)/2 matrix such that vec(A) = Dp vech(A) for any p x p matrix A, and the elimination matrix L, is the p(p+ 1)/2 x p2 matrix such that vech(A) = Lp vec(A) for any p x p matrix A (see Liitkepohl (1996), pages 8-9); we refer to Litkepohl (1996) for a description of the properties of such matrices.

The Fisher information matrix for E1i),a(i)

has the form

-1H(Ei),fa(i))

= DT(EYfa(i),fa(i) @ Efa(i),fa(i))Dp

where 0 denotes the Kronecker product (see Gutierrez-Pefia and Smith (1995), equation (18)) and p is the cardinality of the set fa(i). One can check that

H(fai),fa(i))-= =Dp

(IT 0 2iiT)D,

=

•Dp[(i

0 i Ii)(XiT 0 JjiT)]D

=-2- 1 [DOT( 1 ?ii)][DT(Ji(i g 4li)]T

where Vi = ()- The Jacobian J of the transformation

E-iVa(i,) -? is given in Liitkepohl (1996), section 10.5.4, expres-

sion (3), and has the form

J = 2D+(Q iT Ip)LpT

where Ip is the p x p identity matrix and D = (DDp,)-'D Making use of some known matrix alge- bra results (see Liitkepohl (1996), section 2.4, expression (5), section 9.2.1, expression (16), section 9.2.2, expression (2), and section 9.5.2, expression (1)) we have that

JT[DT(Ji 0 i)] = [2D +(@iT 0 Ip)L ]T[DpT(i

)]

= Lp,(Q

0 Ip)(Ip2

+ Kpp)(Fi'

0 Wi') =

Lp,(Q & I,)(Ji 0 JIoi)(Ip2 + Kpp)

= Lp(Qi0 ' 0 IpQ'i)(Ip2 + Kpp)

= Lp(Ip 9~Ii)(Ip2 + Kpp) =

2L,(Ip, 0 'i)DpD>.

Hence, we can write

H(QV) = JTH H(a(i),fa(i))J = M(2D+D +T)MT

where M = L,(Ip T'i)D,. It can be checked that M is block diagonal and that its first block, corresponding to

0i, is Vi. The matrix D+ DT is described in Liitkepohl (1996), section 9.5.1, equation (11), and it is

such that the submatrix of H(V') corresponding to qi can be written as

Hii() = Vi(2 0 )JxiT. (10)

Secondly, to construct the reference prior on 0 = (01,..., •,) we make use of a result in Datta and Ghosh (1995). Assume that the Fisher information matrix for 0 is block diagonal, H(0) = diag{HHl1 (q), H22(), 5..., Hvv(0)}, with Hii(0) a square submatrix of dimension dim(0i). If the determinant of Hii(0) can be factorized as

jHii(0)j =

ai(0j) bi(01, ..., )i-l, i+1, ? ? ?,)v)

i = 1, ....

v (11)

for some positive functions ai(')

and bi('),

then the ordered group reference prior relative to 4 does not depend on the order of the v groups and has density with respect to Lebesgue measure r(b) oc IIW ai(qi)1/2 For conditions under which this prior is unique see Guti6rrez-Pefia and Rueda (2003).

We now prove that the determinant of Hii(q) in equation (10) can be factorized as in equation (11) with ai(qi) =

1/i$i so that

r(q) oc Hiai(d)1/2 =

F (12) i=l i=1 ii




Now partition TI' and i' as

11 tlpa(i)aniliI1 (Dipa(i) = Ipa(i),ipa(i)

and i pa(i),pa(i)

The determinant of Hii(q) can now be written as

Hii(q) = 2(0'1 )2 Ipa(i),pa(i) 2

and we show that i Pa(i),pa(i) is a function of ( 0i+1 . ...,

qv) whereas 0' is a function of 4i. If we put pr(i) = {i + 1,...,v}, then pa(i) C pr(i) and Epa(i),pa(i) = a(i),p pi) ,pa(i) is a submatrix

of Epr(i),pr(i). Hence

~,ipa(i),pa

is a function of Epr(i),pr(i). Let 4= 1{i]

denote the matrix, defined in Section 5, made up ofthe elements of 0 so that E-' = 4?T . It is not difficult to check that, because of the upper triangular form of D,

pr(i),pr(i) pr(i),pr(i) pr(i),pr(i)

for all i = 1,..., v (see also Roverato (2000)). Thus Epr(i),pr(i) is a function of (Oi+l, ... ?,v) and the same is true for Tpa(i),pa(i)'

Because of the upper triangular form of \I = (ri)-1 it holds that 'il = 1/'I . By standard results on the multivariate normal distribution (see Whittaker (1990), page 143) (0',)2 fa(i),fi 11 iia(i) Since, by expression (8), ii =

1/VOii.pa(i), it turns out that 01, = 4ii and 0I,1 = l/ii. We can conclude that, with respect to equation (11), we can put ai(qi) = 1/ i and bi(i+l ... , v) =

1pa(i),pa(i) 2 from which equation (12) follows.

References

Andersson, S. A., Madigan, D. and Perlman, M. D. (1997) On the Markov equivalence of chain graphs, undirected graphs and acyclic digraphs. Scand. J Statist., 24, 81-102.

Berger, J. 0. and Bernardo, J. M. (1992) On the development of the reference prior method. In Bayesian Statistics 4 (eds J. M. Bernardo, J. O. Berger, D. V. Lindley and A. F M. Smith), pp. 35-60. Oxford: Oxford University Press.

Bernardo, J. M. (1979) Reference posterior distributions for Bayesian inference (with discussion). J R. Statist. Soc. B, 41, 113-147.

Bernardo, J. M. and Smith, A. F M. (1994) Bayesian Theory. Chichester: Wiley. Consonni, G. and Veronese, P. (2003) Enriched conjugate and reference priors for the Wishart family on symmetric

cones. Ann. Statist., 31, in the press. Cowell, R. G., Dawid, A. P., Lauritzen, S. L. and Spiegelhalter, D. J. (1999) Probabilistic Networks and Expert

Systems. New York: Springer. Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies-Models, Analysis and Interpretation. London:

Chapman and Hall. Datta, G. S. and Ghosh, M. (1995) Some remarks on non-informative priors. J Am. Statist. Ass., 90, 1357-1363. Datta, G. S. and Ghosh, M. (1996) On the invariance of noninformative priors. Ann. Statist., 24, 141-159. Dawid, A. P (1981) Some matrix-variate distribution theory: notational considerations and a Bayesian applica-

tion. Biometrika, 68, 265-274. Dawid, A. P and Lauritzen, S. L. (2001) Compatible prior distributions. In Bayesian Methods with Applications

to Science, Policy and Official Statistics: Proc. 6th Wrld Meet. International Society for Bayesian Analysis (ed. E. I. George), pp. 109-118. Luxembourg: Office for Official Publications of the European Communities. (Available from http: / /www. stat. cmu. edu/ ISBA/ index .html.)

Frydenberg, M. (1990) The chain graph Markov property. Scand. J Statist., 17, 333-353. Geiger, D. and Heckerman, D. (2002) Parameters priors for directed acyclic graphical models and the character-

ization of several probability distributions. Ann. Statist., 30, 1412-1440. Gutierrez-Pefia, E. and Rueda, R. (2003) Reference priors for exponential families.

J. Statist. Planng

Inf., 110,

35-54. Gutierrez-Pefia, E. and Smith, A. F. M. (1995) Conjugate parameterizations for natural exponential families.

J Am. Statist. Ass., 90, 1347-1356. Heckerman, D., Geiger, D. and Chickering, D. (1995) Learning Bayesian networks: the combination of knowledge

and statistical data. Mach. Learn., 20, 197-243. Kass, R. E. and Raftery, A. E. (1995) Bayes factors. J Am. Statist. Ass., 90, 773-795. Lauritzen, S. L. (1996) Graphical Models. Oxford: Oxford University Press. Lauritzen, S. L. (2001) Causal inference from graphical models. In Complex Stochastic Systems (eds O. E.

Barndorff-Nielsen, D. R. Cox and C. Kliippelberg), pp. 163-107. Boca Raton: Chapman and Hall-CRC.




Lauritzen, S. L. and Richardson, T. S. (2002) Chain graph models and their causal interpretations (with discussion). J R. Statist. Soc. B, 64, 321-361.

Leucari, V. and Consonni, G. (2003) Compatible priors for causal Bayesian networks. In Bayesian Statistics 7 (eds J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. E M. Smith and M. West), pp. 597-606. Oxford: Oxford University Press.

Liitkepohl, H. (1996) Handbook of Matrices. Chichester: Wiley. Pearl, J. (2000) Causality: Models, Reasoning and Inference. Cambridge: Cambridge University Press. Roverato, A. (2000) Cholesky decomposition of a hyper inverse Wishart matrix. Biometrika, 87, 99-112. Roverato, A. (2002) Hyper inverse Wishart distribution for non-decomposable graphs and its application to

Bayesian inference for Gaussian graphical models. Scand. J Statist., 29, 391-411. Spiegelhalter, D. and Lauritzen, S. L. (1990) Sequential updating of conditional probabilities on directed graphical

structures. Networks, 20, 579-605. Spirtes, P., Glymour, C. and Scheines, R. (1993) Causation, Prediction and Search. New York: Springer. Verma, T. and Pearl, J. (1990) Equivalence and synthesis of causal models. In Proc. 6th Conf Uncertainty in

Artificial Intelligence (eds P. Bonissone, M. Henrion, L. N. Kanal and J. F Lemmer), pp. 255-270. Amsterdam: North-Holland.

Wermuth, N. (1980) Linear recursive equations, covariance selection and path analysis. J Am. Statist. Ass., 75, 963-972.

Whittaker, J. (1990) Graphical Models in Applied Multivariate Statistics. Chichester: Wiley.



compatible prior distributions for directed acyclic graph models

Documents