department of philosophypages.shanti.virginia.edu/paul_humphreys_home_page/... · a critical...

A CRITICAL APPRAISAL OF CAUSAL DISCOVERY ALGORITHMS

Paul Humphreys

Department of Philosophy

University of Virginia

The research program described in detail in Spirtes,

Glymour and Scheines 1993 and outlined by Clark Glymour in the

paper published in this volume (Glymour 1995) has more than one

goal. One of these goals is to provide a connection between

causal structure and probabilities by using an explicit

axiomatization that links directed graphs with probability

distributions over the vertices of those graphs. In using an

axiomatic approach, these authors approach the project of

connecting probability and causation from a different direction

than most other authors, for in the philosophical literature

connections between probability and causation have usually taken

the form of explicit definitions. Sometimes these definitions

attempt to reduce causation to probabilistic concepts (e.g.

Suppes 1970), in other approaches the procedure is to build up

complex causal relations from simple causal relations taken as

primitive, using probabilistic invariance as the defining

characteristic of causal relations (e.g. Humphreys 1989).

The axiomatic approach has, in contrast to these

philosophical efforts, been used to good effect in Judea Pearl's

earlier work on probabilistic inference in artificial

intelligence (Pearl 1988). Spirtes, Glymour, and Scheines'

(hereafter referred to as SGS) interest is also in making

inferences, the idea being that on the basis of statistical

independence and dependence relations between variables perhaps

supplemented by auxiliary information, we can infer the presence

or absence of causal relations between those variables and

represent the causal relations by edges in a directed graph. I

shall address later the important question of the exact sense in

which these graphs are actually representations of causal

relations, rather than of probabilistic dependency relations,

but for now it bears emphasizing that Glymour's account is not a

probabilistic theory of causation in the sense usually

understood by philosophers -- the relations between variables

are deterministic for Glymour, and increases in probability play

no role for him in identifying causal relations.1

A second goal is to provide computationally feasible

algorithms based on those axioms. These algorithms are designed

to discover at least some of the causal relations between

specified variables

of scientific interest and, importantly, SGS claim to be able to

sometimes make these discoveries without using theoretical

background knowledge about which variables can and cannot be

3

causally connected. In doing this, these algorithms are said to

have an advantage over existing methods of inferring causal

structure, such as structural equation models, which do

explicitly have to rely on non-statistical assumptions about

what are plausible and implausible candidates for causal

relations between variables. Such assumptions are often called,

very misleadingly in my view, `a priori' assumptions.2 It is

better to view them as essentially using background knowledge

about causal relations, where this background knowledge does not

come from the data. This background knowledge may be based on

explicit theory or it may be tacit practitioner's knowledge or

it may be just `common sense' but it will rarely be a priori in

the traditional philosophical sense of knowable independently of

any particular empirical experience.

The extent to which the SGS algorithms require background

knowledge varies, and their presentation of their case sometimes

makes the data driven nature of their enterprise seem purer than

it actually is. But as far as I can tell, they never make the

extreme claim that data analysis alone can give a unique causal

graph. Sometimes, a time ordering on the vertices, together

with other assumptions, will do (SGS 1993, p. 133: more on this

in section II below) and indeed all of their algorithms employ

that information (ibid, p.127); in other examples background

4

causal knowledge is incorporated into the programs (ibid,

p.127). Yet they also claim that their programs recovered

almost all of a 37 variable model of an emergency medical system

without any prior information about the ordering of the

variables (ibid., pp.11, 146). It would be more accurate, I

think, to portray their primary interest as an investigation of

to what extent nonexperimental methods can arrive at causal

conclusions, and under what assumptions (e.g. ibid, p.vii, 3,

10, 22).

A third goal is to develop methods for predicting the

results of interventions in existing causal structures. I shall

not have much to say here about this third goal, for it has been

discussed in admirable detail by James Woodward and David

Freedman in their contributions to this volume (Woodward 1995,

Freedman 1995). Fourthly, the research program suggests that

most existing methods of representing causal structure,

including regression, factor analysis, structural equation

models and so on--essentially the whole spectrum of contemporary

econometric and sociometric causal methods--are seriously

inadequate to their tasks. Finally, and perhaps most

surprisingly, SGS's approach does all this in a way that is

supposedly free from the need to say explicitly what causation

is or to provide a definition of causation. To a lesser extent

5

SGS are inexplicit about what is the appropriate interpretation

of probability to use in their approach. This ability to work

with (explicitly) undefined terms within an axiomatic framework

is, of course, one of the great methodological advantages of the

axiomatic approach but it also has its well known drawbacks, the

worst being that it results in too broad a class of structures

that satisfy the axioms. We shall see that this is a defect

that is clearly present in SGS's framework, and it undermines

their claim to have given algorithms that discover causal

structure; to eliminate noncausal epistemic relations, the

supplementary information used in the discovery process plays a

crucial role.

By building on the results of Judea Pearl and his research

group, SGS have provided a powerful constructive approach to

some extremely complex and difficult problems. There is no

doubt that their contributions have moved discussions of causal

structures a giant step forward. And I agree with the spirit of

their remarks that an unhealthy strain of scepticism on the part

of philosophers is often an excuse for avoiding hard

constructive thinking about what can be done in this area. It

must be said, though, that this scepticism is by no means

confined to philosophers, for statisticians have tended to be at

least as sceptical about these matters. (See e.g. Freedman 1983,

6

1995). Yet I find their belief that an intuitive understanding

of causation can carry us a long way in using their methods

quite wrong. In fact, when we come to closely examine the SGS

methods, we shall find that not only is it hard to put a

consistent interpretation on the axioms, but it is by no means

obvious that they have any real causal content at all. As I

shall argue, the most plausible construal of the graphs is that

they are not representations of causal relations understood in

some robustly realistic sense of causation, but are devices for

representing epistemic dependency relations.

I

Let us begin with the relation between graphs and

probability, which will be the principal focus of my remarks.

Some straightforward and obvious comments about the assumptions

that SGS use are in order because it is easy for elementary

points to get lost once one becomes immersed in the complex

details of the SGS programme. In essence, SGS's idea is to get

at the notoriously elusive concept of causation by showing that

certain causal relations, represented in graph-theoretic form,

entail certain probabilistic relations, primarily stochastic

independence and dependence. Because there is a sense that

direct contact between probabilistic relations and empirical

data is possible through relative frequency values or subjective

7

probability values in a way that direct contact between causal

relations and empirical data is not, the SGS methodology

hypothesizes that certain causal relations entail specific

probabilistic dependency relations. Using those entailment

relations, one can at the very least eliminate the possibility

of a causal connection between two variables by virtue of

certain observed statistical independencies. There are many

questions that arise from this rich apparatus. Here I shall

focus on the following three:

1. Do the assumptions on which the representational apparatus is

based contain or presuppose specifically causal content?

2. What is the correct interpretation of causation to use with

these methods?

3. What is the status of the representational apparatus that SGS

use to capture causal relations?

I begin with the first of these but I want to stress that

these three questions are intimately linked and one cannot

properly answer the first without a clear answer to the second

and third. Some of the assumptions on which the algorithms are

based are explicitly cited in Glymour (1995), others are to be

found in SGS (1993).

1.The Markov Assumption. This has a number of different

formulations but one of the most general is:

8

Given a directed acyclic graph (DAG) G over a set of

vertices V representing random variables, and a joint

probability distribution P(V) over V, then <G,P> satisfies the

Markov condition iff for every vertex W in V, then given Parents

(W), W is probabilistically independent of any other vertex X ε

V - {Descendants (W)} U {Parents (W)}.

2. The Faithfulness Assumption. The probability distribution P

that is associated with a DAG G representing the causal

structure of a system is such that every conditional

probabilistic independence relation in P follows from the Markov

condition applied to G. Roughly, there are no "accidental"

independencies.

3. Working Assumption. The distribution of observed variables in

a population is the marginal of a distribution satisfying the

Markov and faithfulness conditions.

4. Causal Sufficiency. The set of vertices of the DAG used to

represent causal structure is closed under common causes; i.e.

every common cause of measured variables is explicitly included

in the model.

5. No Side Affects Assumption. An ideal intervention that

directly manipulates T may alter the distribution of T

conditional on the parents of T, but the distribution of every

other variable conditional on its parents is unchanged.

9

I have here noted only those assumptions used in

determining the causal structure when the model is correctly

specified--i.e. when all the relevant variables have been

identified and included in the model and the goal is to discover

which variables are causes of which others. One exception here

is that error variables are not explicitly included in the

model, but are always assumed to be present for every variable

in the system. The reason for this is that if one variable is a

(deterministic) function of some others then the discovery

procedure can give the wrong results.3 So to avoid this, some

source of purely exogenous stochastic variation is attached to

every variable in the model, and these error variables are

mutually independent. When this condition is satisfied the

graph is called pseudo-indeterministic. We add this as an

explicit assumption.

6. Pseudo-Indeterminism. Each vertex of the DAG has an

unrepresented single parent (an `error variable') which is

independent of every other such error variable and which has a

non-degenerate distribution. All DAGs considered are pseudo-

indeterministic.

(I note here that the independence of the error terms

follows from the Causal Sufficiency assumption, and that the

Causal Sufficiency assumption is dropped for partially oriented

10

inducing path graphs.)

In addition to these assumptions, there is one definition

that lies at the core of the SGS algorithms. This is the

definition of d-connectedness, and it is here that I find SGS's

position on causation difficult to understand.

Definition. Two vertices X, Y in a DAG G, where (X ≠ Y) and X,

Y do not belong to Z, are d-connected by a set of vertices Z if

and only if there is an undirected path U between X and Y of the

form (1) every collider in U has a descendant in Z (including

the null descendant) and (2) every other vertex on U is outside

Z. X, Y are d-separated by Z just in case they are not d-

connected by Z.

The concept of d-connectedness is not easy to grasp, but

what it means is this: Suppose we have the case:

Then X,Y are d-connected by {V} but d-separated by {V,A}. Thus,

if we knew V, then from a knowledge of X, we could reliably

infer what value Y would have.

11

This concept of d-separation plays a key role in both the

SGS algorithm and the PC algorithm used by SGS because it is

the principal device by means of which directions are given to

edges in a DAG on the basis of statistical dependency relations.

As SGS put it "d-separation in fact characterizes all and only

the conditional independence relations that follow from

satisfying the Markov condition for a directed acyclic graph"

((1993), p.72).4 The definition itself is in purely graph

theoretic terms--there are no causal concepts involved. The

first question I want to raise here is this: Is there a

consistent causal interpretation of d-separation that can

motivate its use in the algorithms used by SGS to arrive at

causal structures?

Before I discuss this, let's be clear about what the

perceived difficulty is. At the start of his (1995) paper,

Glymour asserts "While I will rely on some intuitions about

relevant features of the notion of causation, I avoid throughout

any jejune attempt at definition." That approach is fine, and

fits well with the axiomatically based apparatus he has

developed. The method seems to be this: We begin with a

preformal set of intuitions about causation, and about the

relations between causal connectedness and probabilistic

dependence. Then the Working Assumption (number 3 above) is used

12

as the basis of justifying the claim that every observed

statistical independence relation is a result of the causal

relations embodied in the DAG and the Markov condition. That is,

the claim is that whenever one has a purported example of

statistical independence that does not result from a causal

connection, it falls into one of a set of recognizable

misapplications of the method. These include, for example,

nonhomogeneity of the population to which the model is applied

or underspecification of node variables.

But SGS also claim (1993, p. 41) in their chapter on axioms

and their interpretations: "We advocate no definition of

causation, but in this chapter we try to make our usage

systematic, and to make explicit our assumptions connecting

causal structure with probability, counterfactuals, and

manipulations. With suitable metaphysical gyrations the

assumptions could be endorsed from any of these points of view,

perhaps including even the [view that prefers not to talk of

causation at all]." I do not think that SGS have been successful

in keeping their use systematic, and I shall show that only an

epistemic interpretation fits their apparatus as presented.

When Judea Pearl introduced the concept of d-

connectedness, it was within a framework where inferential

connections were the primary focus. In fact, one of the

13

principal reasons for using directed graphs rather than

undirected graphs is because of the latter's inability to

correctly represent cases of common effects. Pearl uses an

example in which two coins are independently flipped and a bell

rings if and only if both coins are the same. When we know that

the bell rang (and when we know it did not) we can infer with

certainty from the outcome of one coin toss what the other coin

outcome was. This induced dependency between the coin outcomes

is clearly epistemic -- it is certainly not causal. As Pearl

says (1988, p.93) "This weakness in the expressive power of

undirected graphs severely limits their ability to represent

informational dependencies". Within what Pearl calls a Bayesian

network, a distinction is made between a joint effect of two

causes (a collider) on the one hand, and a common cause or an

intermediate cause between X and Y on the other. If you examine

the definition of d-connectedness, clause (1) requires that on a

d-connecting path all such colliders be in the conditioning set

or a descendant of a collider be in it. On an epistemic

interpretation, this makes sense. As we saw in the bell

example, learning about the common effect gives us information

about the causes, and by an inference that goes against the

causal direction, a descendant of the common effect can do the

same thing (suppose a dog salivates if and only if the bell

14

rings, for example). Clause (2) requires that one cannot

condition on a common cause or an intermediate cause. So with

d-connectedness, by conditioning on Z, we can learn something

about events further back in the causal network. It is this

asymmetry between common effects and common causes or

intermediate causes that allows the SGS and PC algorithms to

discover colliders in the case where conditioning on the middle

element Z in a sequence X -Z - Y fails to render X and Y

independent. Thus far, we have discussed d-connectedness in

terms of information. But as we have seen, SGS want a much

wider class of interpretations for causal networks than that.

In his paper included in this volume, Glymour is not very

forthcoming about what interpretation to give to d-

separatedness. He says "There is even an intuition to [d-

connectedness] which I will not detail here". (1995, p. ) An

intuition is in fact detailed in the SGS book on pp. 72-73, but

it is not consistent. We are first given an analogy to causal

flow in a pipe, where a collider corresponds to a closed valve

but a common cause and an intermediate cause correspond to an

open valve. This causal flow analogy makes sense along the

lines of an interpretation of causation involving transfer of

energy or transmission of a mark or manipulability, for example.

Next, we are told that conditioning on a open (active) mode

15

converts it to a closed (inactive) mode because we know that

keeping a variable of that type fixed renders X and Y

independent. This makes sense on a manipulability criterion,

and we know that this conversion from one status to another is a

familiar fact about common causes and "screening off". It makes

sense on an inferential view too: once we know the common cause,

knowledge of one effect does not provide further information

about the other, and knowledge of proximate causes renders

knowledge of remote causes irrelevant. But when it comes to

common effects (colliders), SGS (1993, p.72) say merely "that

conditioning on a collider makes it active was noted in section

3.5.2 above".

When we turn to section 3.5.2 we find two types of cases

where this can happen. One involves a "Bayesian example" drawn

from Pearl which involves inference from the known facts that

your car will not start and that your battery is not dead to the

fact that the fuel tank is empty. The other involves a case

related to Simpson's paradox whereby factors independent in the

whole population become independent when conditioned upon a

joint effect factor that subdivides the population frequencies.

Yet the second of these gives us no insight into how causal flow

can be directed through a collider, and since SGS are opposed to

specifically probabilistic accounts of causations, either of the

16

type that relies solely on statistics to characterize causal

relations or that uses probability increases and decreases as

the key to causation, the onus must lie on some other

interpretation of the graph. Certainly a manipulability view of

causation will not account for why a collider becomes active

when it or a descendant is conditionalized upon; nor does it

make sense on a causal flow view. The only view on which it

makes sense is an inferential view where it is as legitimate to

infer from an effect to a cause as vice versa, and in certain

circumstances from one joint cause to another, conditional on

their common effects.

This has two consequences. If the only consistent

interpretation that can be laid on the DAGs is one of causal

inference and not of causation itself, then the liberality of

interpretation suggested by SGS is misplaced. Given the

centrality of d-separation in their discovery procedure, there

is one and only one thoroughgoing interpretation of these

"causal" graphs and that is an epistemic one. Second, when SGS

turn to prediction, they focus almost exclusively on

manipulation as the basis of the No Side Effects Assumption and

its consequences. This must mean that we have to switch causal

horses when moving from one part of the research program to

another.

17

II

It might be said that there is no need to have an

interpretation of d-separation. It could be viewed as an

instrumental device that in some hard to understand way gets us

the right causal graphs in conjunction with the other

assumptions. All that is important is arriving at the correct

causal graphs and the evidence SGS provide in terms of

comparative performance against existing studies does that.

This is a reasonable response, and one that revolves around two

different uses of the axiomatic method: one where all the axioms

are interpreted and truth flows from those axioms to the

theorems and the structures that result from applying the

theorems, and the other where all the truth lies in the

consequences and the axioms pick up whatever interpretation, if

any, that is consistent with those consequences. Suppose we

adopted this second view (which I am not personally in favour

of). An immediate question then arises: how do we know that the

causal structures which result from applying these axioms are

true? Now sometimes SGS declare a successful application

because their result almost conforms to the structure of an

existing model arrived at by other means (e.g. the Rodgers and

Maranto example (Glymour 1995, p. ; SGS 1993, pp. 134 ff).

But on other occasions, such as with the Rindfuss, Bumpass, St.

18

John example (Glymour 1995, p. ; SGS 1993, pp. 139-40) the PC

algorithm is declared to have superior results to existing

methods, in this case because Rindfuss et al initially included

a causal connection from the variable representing the age at

which a first child is born to the variable representing the

education level of women at time of marriage, but they later

found the regression coefficient to be zero, whereas the PC

algorithms deleted that edge in the course of the discovery

process and it was not included in the final model. Now since

causal connections are not directly observable, if we are

employing the second axiomatic strategy where content flows from

the consequences to the axioms, we cannot rely on directly

knowing the truth of the axioms but instead we must rely on our

intuitive or theoretical judgements about causation in

particular cases to validate specific models. (The same point

holds with other cases, such as the Blau and Duncan example, but

I have not discussed those because they employ partially

oriented inducing path graphs rather than the basic apparatus

discussed here). This, of course, is one of the original

questions we raised about these methods: What kind of causal

knowledge, if any, must we have in order to use the SGS methods?

There are (at least) two kinds of causal knowledge that might be

involved. The first is knowledge required to understand the

19

assumptions--this will be in general knowledge about what kind

of causation and probability is involved. The other is

knowledge required to apply the assumptions in particular cases.

Consider in this regard the Markov assumption, upon which much

of the SGS method is built. Within the SGS approach, the Markov

assumption is taken as an axiom, and there are two versions of

it. The first Markov assumption (SGS 1993, p.33) which we

earlier listed as the first assumption, is a purely formal

criterion linking graph structures with probabilistic relations,

and has no causal content. The Causal Markov Condition (ibid,

p.54) is slightly less formal, but not much. It simply asserts:

Let G be a causal graph with vertex set V and P be a probability

distribution over the vertices in V generated by the causal

structure represented by G. G and P satisfy the Causal Markov

Condition if and only if for every W in V, W is independent of

V/(Descendants (W) U Parents (W)) given Parents (W).

But this reference to the causal interpretation of the

graph is unnecessary, because the (formal) Markov condition is a

consequence of a completely acausal condition. To see this,

compare the SGS approach with the one taken by Pearl, within

which the axiomatic formulation was qualitative probabilistic

independence and independency maps. Take the following

definition:

20

Definition. A DAG G is a directed independence graph of P(V) for

an ordering > of the vertices of G if and only if A -> B occurs

in G if and only if ¬(A ∐ B ∣ K(B)), where K(B) is the set of

all vertices V such that V ≠ A and V > B, where A B│Z means

that the joint probability density of A and B, given Z, factors

into the conditional densities of A given Z and of B given Z.

Pearl then proved this result:

If P(V) is a positive distribution, then for any ordering

of the variables in V, P satisfies the Markov and Minimality

conditions for the directed independence graph of P(V) for that

ordering.6

So, my point is this: since, given 1) some ordering of the

vertices (which could be merely temporal ordering) a strictly

positive distribution over the vertices guarantees that the

formal Markov condition will be satisfied, 2) the fact that we

can check whether there are any null probabilistic dependencies

in a purely numerical way, and 3) the fact that SGS can often

eschew prior causal knowledge in favour of temporal ordering or

no ordering at all (SGS 1993, p.112) to arrive at their graphs,

what role does the causal interpretation of the Causal Markov

Condition play in their algorithmic procedures? This question

is particularly pressing in the case of DAGs that are also

21

directed independence graphs. To put the point another way, we

could eliminate the formal Markov condition for those DAG's that

are also directed independence graphs if we had a prior ordering

on the vertices and a strictly positive distribution over those

vertices. SGS state (1993, p. 111) that, aside from

computational inefficiencies of algorithms based on directed

independence graphs (which is not a causal issue), they want to

eliminate the need for a prior ordering on the variables,

presumably because at least in some cases, a time ordering is

inappropriate or unavailable. But then if the causal

interpretation of the graphs in the Causal Markov Condition is

not being used to impose a prior ordering on the vertices, what

role does it play?

III

I turn now to the question of whether the assumptions of

the SGS approach require specific causal information to be

correctly applied. SGS are quite explicit that in general the

SGS and PC algorithms give an equivalence class of causal graphs

rather than a unique causal graph. The algorithms first give an

undirected graph that is usually less than complete (i.e. it is

derived from a completely connected graph by deleting edges) and

then they identify unshielded colliders. (Unshielded colliders

have the form

22

where A is not directly connected to B and vice versa.) This

partially orients the undirected graph, but leaves open a number

of different way to complete the DAG. For example, the graph

could be completed as

Again, in the Rodgers and Maranto example, the graph in

Figure 6 (of Glymour 1995) is but one of a number consistent

with the initial output of the PC algorithm. In the case of

this example, an appeal to time order of the variables is used

23

to get the final graph and to eliminate other members of the

statistically indistinguishable equivalence class. But this

strategy is not always possible. To see this, consider another

case that SGS describe.

Because the assumptions used are graph theoretic and

probabilistic, we can ask whether these methods can model

relationships that are not causal in kind. And so they can. In

the example involving the AFQT test (Glymour 1995, p. ), the

relations involved are classificational rather than causal. The

variables AR (arithmetical reasoning), NO (Numerical

Operations), and WK (Work Knowledge) are, as SGS note,

components of the whole AFQT test and the PC algorithm picks

them out as variables adjacent to AFQT. But AR, NO, WK no more

cause AFQT than 2,3,4 cause 9. They are constitutive components

of it. (Note that if there is at least one other component of

AFQT that is not included in the graph, and this is independent

of AR, NO, WK, then the system can be pseudo-indeterministic.)

It will not do to merely extend our concept of causation to

include constitutive relations, for a standard objection to

counterfactual analyses of causation carries over to the

manipulability view too. It is true that if X's score on the

arithmetical reasoning test had been different, X's AFQT score

would have been different too (assuming the No Side Effects

24

principle and no fortuitous cancelling by the other components)

but the AR score is not a cause of X's AFQT, any more than an

atom having two electrons is a cause of its being a helium atom

-- it is simply part of the defining conditions for what it is

to be a helium atom.8

Here we can perhaps draw a sharp distinction between the

use of these methods as prediction and manipulation devices, and

their use for causal purposes. There is no doubt that from the

AR scores we can partially predict the AFQT score and that by

manipulating the AR score we can change the AFQT score. But

let's not confuse that with discovering what the causes of the

AFQT score are. So if we wish to isolate the causal, as opposed

to the classificational, connections in a graph, we will have to

bring in additional information, and that information will not

always be common sense or temporal order. For example, if the

individual took the AR test first and the NO and WK tests later,

the AFQT score would only be available later than the AR score.

And we might need a significant amount of theoretical knowledge

rather than mere common sense to know which variables are

components of others. (I note here that the Working Assumption

explicitly allows that classificational relations can be

included in a DAG.).

IV

25

Finally, there is an issue that is not discussed by Glymour

in his contribution to this volume that is important and bears

bringing out explicitly. The discovery procedure is designed to

find independence relations and to eliminate the corresponding

edges in the graph, leaving directed edges as representations of

causal connections. But we ought to remember that there are two

kinds of dependencies, positive and negative--in the linear case

these will be represented by positive and negative correlation

coefficients. This difference is obviously important when one

is interested in interventions, as SGS are. Unless we know

whether Y is positively or negatively connected with X we do not

know whether increasing X will increase Y or decrease it, and

without that knowledge, interventions can be counterproductive,

decreasing SAT scores by paying school administrators higher

salaries for example. Of course, the SGS algorithms are

supplemented by estimation procedures to arrive at numerical

values for parameters in the graph, and this will provide the

sign of the coefficients. The point I want to bring out here is

this: SGS discuss the kind of situations within which the

Faithfulness Condition can fail (1993, pp. 64-69), focussing on

contexts within which Simpson's Paradox occurs. Correctly

noting that Simpson's original example involved statistics for

which positive associations within two subpopulations give

26

independence when the subpopulations are combined, SGS prove a

powerful theorem9 (for the continuous case) showing that the set

of parameter values in linear models for which the Faithfulness

Condition fails has measure zero, and hence for pseudo-

indeterministic linear models, we can proceed comfortably with

the knowledge that the Faithfulness Condition will fail only

under probabilistically extraordinary conditions. But the

original Simpson's Paradox is not all SGS need to worry about.

It will be as bad for them if their algorithms, supplemented by

standard statistical methods, erroneously tell us that a pair of

variables are negatively causally connected when they are also

positively connected in subpopulations. Moreover, these kinds

of problems are not limited to Simpson's Paradox. There is a

related and well-known problem in sociological sampling known as

the ecological fallacy10, wherein associations at the group

level, e.g. between mean educational level and racial

composition within counties, are transferred to associations at

the level of individuals. Positive associations at the group

level can arise from negative associations at the individual

level. So it is not enough for SGS to avoid causal structure

ambiguities caused by independence relations in the data not

being properly represented in the graph - they must also, like

every other causal modeller, be concerned with the right

27

population on which to impose their model. That this problem is

not peculiar to their method does not mean it does not need to

be addressed. And then the question is: can we apply the graph

to the correct population without using prior causal knowledge

of which type of association, positive or negative, is the

correct one? Or to be able to rule out as meaningless variables

attached to populations that properly belong only to

individuals? It is implausible that this could be done using

only common sense and hence the problem of signs is one that

seems to require prior causal knowledge to resolve.

V

Let us summarize the project we have been discussing. It

has two aims. One is to provide a computationally feasible

algorithm for identifying the correct causal relations between

variables in models of social phenomena. The other is to

provide an explicit set of assumptions on which correct causal

inference can (in principle) be based. This second part of the

project is of more philosophical interest than the first, but

because that applied aspect is an essential part of the project,

it is perfectly reasonable to ask whether the assumptions on

which these algorithms are based are applicable in real cases

rather than in principle only. It seems clear that some of them

are not. The frequency version of the Markov condition requires

28

that the probability distribution arises from a homogeneous

population i.e. one in which all the units share the same causal

relations (i.e. they all have the same graph). But in the

nonexperimental contexts within which these algorithms are

supposed to be applied, this will rarely if ever be satisfied.

Social variation is far too great for that. Second, there is

extensive use of "prior information" that some variables are not

causes of others. Thus, for example in the use of the PC

algorithm (Glymour 1995, p. ) it is said that ED (the level

of education at the time of marriage) does not cause the other

variables. One of these other variables is REL (the

individual's religious affiliation). Why is it supposed to be

obvious that the level of education at marriage is not a cause

of one's religious affiliation? One knows many women whose

agnosticism is a direct causal effect of their high level of

education. Similarly, one's level of education is frequently a

cause of where in the United States one chooses to live (REGN),

another link supposedly excluded by `prior information'. The

same point can be made with YCIG (cigarette smoking). It might

well be that the associations between these variables and

educational level are due to some relationship other than a

direct causal link in the direction I have suggested, but to

rule such links out a priori is implausible and methodologically

29

suspect.

VI

So, to conclude: the apparatus that Spirtes, Glymour, and

Scheines have developed is one of impressive power and detail.

It moves discussion of causal inference in nonexperimental

contexts to a richly rewarding new terrain and it has the

surpassing virtue of being remarkably explicit about what it is

up to. It is thus a pity that the genuine achievements in their

research are accompanied by such an inexplicit account of the

exact sense in which their graphs are causal. Until this is

remedied, I remain sceptical that their algorithms are genuine

causal discovery engines. Moreover, movement onto new ground

inevitably brings with it some old questions as well as new

ones. So here again are the questions that future work on their

approach needs to answer:

1. In what sense are the DAGs representations of causal

relationships, rather than of conditional probabilistic

dependency relations?

2. Is there a consistent non-epistemic interpretation to the d-

separation condition?

3. (a) How much causal knowledge is actually needed to apply the

algorithms?

(b) In particular, will aggregation problems require

30

background causal knowledge to select the correct level of

causal analysis?

4. Can these methods separate causal graphs from classificatory

graphs?11

31 NOTES

1. Glymour 1995, footnote 8.

2. An explicit example of this can be found in section 3 of

Simon 1954.

3. See Glymour 1995, p. .

4. See also clauses B) and C) of the SGS algorithm (SGS 1993,

p.114), and clauses B) and C) of the PC algorithm (ibid, p.

117).

5. See the reference in note 1 above.

6. This result is cited on p.35 of SGS 1993.

7. Glymour 1995 (p.12-13 ms); SGS 1993, p.96.

8. A similar error can be found in Sosa 1980, sections 1-4.

9. SGS 1993, p.68-9.

10. See e.g. Robinson 1950.

11. I should like to thank David Freedman, Richard Scheines,

Glenn Shafer, Peter Spirtes, and James Woodward for helpful

conversations and correspondence. Not all of them agree

with the positions taken in this paper.

32 REFERENCES

Freedman, D. 1983. "Structural Equation Models: A Case Study".

Technical Report,

Department of Statistics, University of California at

Berkeley.

Freedman, D. 1995. "From Association to Causation Via

Regression" <add citation for this

volume>

Glymour, C. 1995. "A Review of Recent Work on the Foundations

of Causal Inference" <add

citation material for this volume>

Humphreys, P. 1989. The Chances of Explanation. Princeton,

Princeton University Press.

Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems.

San Mateo, Morgan-Kaufman.

Robinson, W.S., 1950. "Ecological Correlations and the Behavior

of Individuals". American

Sociological Review 15, pp. 351-357.

Simon, H. 1954. "Spurious Correlation: A Causal

Interpretation", Journal of the American

Statistical Association 49, pp. 467-479.

Sosa, E. 1980. "Varieties of Causation", Grazer Philosophische

Studien, 11, pp. 93-103.

Reprinted in Causation, M. Tooley and E. Sosa, Oxford,

Oxford University Press, 1993, pp. 234-242.

Spirtes, P., G. Glymour, and R. Scheines 1993. Causation,

Prediction, and Search. Lecture

Notes in Statistics, 81. New York, Springer-Verlag.

Suppes, P. 1970. A Probabilistic Theory of Causality. Acta

Philosophica Fennica, XXIV.

Amsterdam, North-Holland Publishing Company.

Woodward, James 1995. "Causal Models, Probabilities and

Invariance" <add citation for this

volume>

department of philosophypages.shanti.virginia.edu/paul_humphreys_home_page/... · a critical...

Documents