causality, probability, and time || inferring causality

Cambridge Books Online

http://ebooks.cambridge.org/

Causality, Probability, and Time

Samantha Kleinberg

Book DOI: http://dx.doi.org/10.1017/CBO9781139207799

Online ISBN: 9781139207799

Hardback ISBN: 9781107026483

Paperback ISBN: 9781107686014

Chapter

5 - Inferring Causality pp. 111-141

Chapter DOI: http://dx.doi.org/10.1017/CBO9781139207799.005

Cambridge University Press

5

Inferring Causality

Thus far, I have discussed the types of causes that will be identified, how theycan be represented as logical formulas, and how the definitions hold up tocommon counterexamples. This chapter addresses how these relationshipscan be inferred from a set of data. I begin by examining the set of hypothesesto be tested, the types of data one may make inferences from, and how todetermine whether formulas are satisfied directly in this data (without firstinferring a model). Next, I discuss how to calculate the causal significancemeasure introduced in the previous chapter (εavg) in data, and how todetermine which values of this measure are statistically significant. I thenaddress inference of relationships and their timing without prior knowledgeof either. The chapter concludes by examining theoretical issues includingthe computational complexity of the testing procedures.

5.1. Testing Prima Facie Causality

Chapter 4 introduced a measure for causal significance and showed howprobabilistic causal relationships can be represented using probabilistictemporal logic formulas. This representation allows efficient testing ofarbitrarily complex relationships. In this chapter, I adapt standard PCTLmodel checking procedures to validate formulas directly in a set of timeseries data without first inferring a model (as this can be computationallycomplex or infeasible in many cases).

5.1.1. The set of hypotheses

The initial hypotheses are first tested to determine which meet the conditionsfor prima facie causality, being earlier than and raising the probability oftheir effects. These are a set of formulas of the form:

c �≥r,≤s e, (5.1)

111

Downloaded from Cambridge Books Online by IP 216.165.126.139 on Sat Nov 30 10:35:33 WET 2013.http://dx.doi.org/10.1017/CBO9781139207799.005

Cambridge Books Online © Cambridge University Press, 2013

112 Inferring Causality

where c and e are PCTL state formulas, 1 ≤ r ≤ s ≤ ∞, and r = ∞. Toform this set, the simplest case is when we have some knowledge of thesystem and either explicitly state the formulas that may be interesting oruse background information to generate the set. Data on risk factors for dis-ease may include gender, race, and age group, but we can avoid generatingformulas for scenarios that are mutually exclusive or which cannot change,so that while a person may be of multiple races, these cannot change overtime and there are also constraints such that a person cannot be simultane-ously elderly and a child. Similarly, we may not know the exact connectionbetween neurons in a particular scenario, but may have background knowl-edge on the timing between one firing and triggering another to fire. Herewe could choose to generate increasingly large formulas and stop at somepredefined size. Another approach is to determine this threshold based onwhether the quality of causal relationships is continuing to increase usingthe associated εavg measures (as, assuming the scores are the same, thereis generally a preference for simpler explanations). With limited data, wecould begin by determining what types of formulas may be found in them (atsatisfactory levels of significance) based on formula size, length of the timeseries, and the number of variables. However, efficient hypothesis genera-tion in general remains an open problem. The initial hypotheses include timewindows, but these need not be known a priori and section 5.3 discusses indepth how to infer the timing of relationships in a way that converges towardthe true timings. When timings are unknown one can use this approach totest formulas with a set of associated time windows, iteratively perturbingthese for the significant relationships to determine their true boundaries.

5.1.2. The set of data

Taking the set of hypotheses and determining which members are primafacie causes means testing first whether the relationships are satisfied inthe data and second whether this probability is higher than the marginalprobability of the effects. Assume that the data consist of a series of time-points with measurements of variables or the occurrence of events at each.A subset of one dataset (which may have any number of timepoints andvariables) might look like:

t1 t2 t3a 1 0 1b 0 0 1c 1 1 0



5.1. Testing Prima Facie Causality 113

Here there are observations of three variables at three timepoints. Ateach timepoint, each variable is measured and is either true or false, butin other cases we will need to choose how to handle non-observations.For example, when analyzing electronic health records, when there is nomention of diabetes in a patient’s record we would generally assume thatthe patient is not diabetic even though this is never explicitly stated.1 Onthe other hand, questions about a patient’s social history (e.g., alcohol useand smoking) are not always asked or answered, so we cannot infer thata patient with no mentions of past or present smoking has never smoked(leaving aside whether the answers given are truthful). Thus this distinctionis a methodological choice that depends not only on the domain but also thecharacteristics of the particular variable. In this book, the default assumptionwill be that non-occurrence is interpreted as false, as this is appropriate inthe types of cases studied here. In the preceding example, the set of atomicpropositions in the system is {a, b, c}. Here a proposition occurring (beingtrue) is denoted by 1, and not occurring (being false) by 0. At t1, a and care true. Another way of describing this is to say that the system is in astate where a and c are true.2 Each observation yields a state the systemcan occupy, and the temporal order of these observations shows possibletransitions between states. We assume there is some underlying structure,which may be very complex and include thousands of states, and we areobserving its behavior over time.

There are two main types of data. The first is a long sequence of times,which is one of many partial runs of the system. The second is a group of(usually shorter) observation sequences (also called traces in model check-ing). Cases like this can arise in medical domains, where there are setsof patients observed over time. While one long run may initially seemequivalent to many shorter runs, there are some important distinctions. Tounderstand these, consider the structure shown in figure 5.1. Say we thenobserve the sequence P, Q, S, T, . . . , T, S, . . . and do not know this under-lying model (as is normally the case). With only this one trace (beginningfrom the start state, s1), we will never see the transition from s1 to s3 (i.e., P ,R) and will not know that it is possible. However, with a large set of shorttraces, as the size of this set increases (assuming no bias in the sample),

1 In practice this is more complicated, as true diagnoses are omitted for a variety of reasons.2 This work assumes that variables are either discrete or have been binned so they relate to

propositions that are true or false in order to represent the relationships using PCTL. In otherwork, I have extended PCTL to continuous variables with the logic PCTLc and developedmethods for assessing the impact of causes on continuous-valued effects (Kleinberg, 2011).




s1P

s3R

s2Q

s4S

s5T

s6 s7

0.50.5

0.60.4

0.40.6

11

11

Figure 5.1. Example of a probabilistic structure that might be observed.

we will get closer to observing the actual transition probabilities so half thetraces will begin with P, Q and the other half with P, R.

In practice many systems, such as biological systems, have cyclic patternsthat will repeat over a long trace. While the true start state may only beobserved once, other states and transitions will be repeated multiple times.In such a system, we can then infer properties from one long trace. However,when the system is nonrecurrent, inference may require a set of tracessampled from a population. If there are properties related to the initialstates of the system and they do not occur again, they cannot be inferredfrom a single trace. When one has control over the data collected, it is worthnoting the differences between the two types.

5.1.3. Intuition behind procedure

Before discussing how to test logical formulas in data, let us discuss anexample that illustrates the general idea. Say we are testing c �

≥1,≤2≥p e

for some p and we observe the sequence c, a, cd, f, e, ac, eb. This can berepresented as the following sequence of states and transitions:

c a c, d f e a, c e, b

Now we must determine whether the probability of the formula, giventhis observation, is at least p. Thus, the part of this sequence that we areinterested in is:

c c e c e




Since the underlying structure is not known and will not be inferred here,(consider that for a set of 1,000 stocks that only go up or down, there are21000 possible unique states and there may be multiple states with the samelabels), we view all instances of c as possibly leading to e, regardless ofwhat the current underlying state is (there may be no path, or there couldeven be a deterministic path). At this point, any time where c is true seemsidentical. That means that the previous sequence looks like the followingset of paths:

c

c e

c e

and would seem to be generated by the following (partial) structure:

c e

The probability of the leads-to formula being tested is the probabilityof the set of paths leading from c to e (within the specified time limit),which is defined by how frequently those paths from c are observed. Thus,with a trace of times labeled with c and e, and a formula c �≥1,≤2 e, theprobability of this formula is the number of timepoints labeled with c,where e also holds in at least one and fewer than two time units, divided bythe number of timepoints labeled with c. In this example, the probability isestimated to be 2/3. The trace is then said to satisfy the formula c �

≥1,≤2≥p e

if p ≤ 2/3.Alternatively, one could begin by inferring a model. Then, satisfaction

of formulas and algorithms for model checking are exactly that of Hanssonand Jonsson (1994). However, model inference is a difficult task and mayinclude development of phenomenological models and abstractions of thetrue structure. When inferring a model, there is again the problem of a




nonrecurrent start state. The probabilities inferred will also not necessarilycorrespond to the true probabilities of the structure. Further, it is unknownwhether, as the number of observations tends toward infinity, the inferredmodel approaches the true structure. Since we are generally interested in aset of properties that is small relative to the size of the underlying structure,we focus on inferring the correctness of those properties. In other cases, fora relatively small structure one may wish to begin by inferring a model.

5.1.4. Satisfaction of a formula

Testing the set of hypotheses for prima facie causality means testing whether(and with what probability), each relationship in a set of formulas is satisfiedby the data, which may be either one long time series or a set of shorterones. The approach to testing formulas in data developed in this book differsFor an

introductionto the

problem ofruntime

verification,see Leucker

andSchallhart

(2009).

from others built on PCTL (Chan et al., 2005) as we must deal with 1) longtraces that cannot be broken up into shorter traces based on knowledge ofthe start state (as it is not assumed that there is a model) and 2) short traceswhose first observations vary and are not indicative of the start state. Thus,we cannot use the usual approach of computing the probability of a formulaby finding the proportion of traces that satisfy it. Here frequencies refer tothe number of timepoints where formulas hold, and for a set of traces, thefrequencies are those in the combined set of timepoints.

The satisfaction and probability of PCTL formulas relative to a traceconsisting of a sequence of ordered timepoints is as follows. Measurementsmay be made at every point in time, for some granularity of measurement, orthere may be time indices of the measurements such that we can compute thetemporal distance between pairs. Each timepoint has a set of propositionstrue at that timepoint. These may be events that either occur or do not, orwhose truth value can otherwise be determined at every timepoint along thetrace. As mentioned before, one may only observe positive instances andmay have to determine whether not observing a value should be treated asit being false or as an instance where its truth cannot be determined. Eachtimepoint is initially labeled with the atomic propositions true at that time.3

From these propositions, one can construct more complex state and pathformulas, which describe properties true at a particular instant (a state) orfor a sequence of times (a path). In the following, t denotes a time instant in

3 In the event that a structure is given, this procedure is unnecessary and one may proceed with thealgorithms of Hansson and Jonsson (1994), using the modified version of leads-to. Note that itis unlikely that we will begin with a structure, and attempting to infer one may introduce errors.




the observed trace. PCTL formulas are defined recursively, so I enumeratethe types of formula constructions and how each are satisfied in a tracebelow.

1. Each atomic proposition is a state formula.

An atomic proposition is true at t if it is in L(t) (the labels of t).

2. If g and h are state formulas, so are ¬g, g ∧ h, g ∨ h, and g → h.

If a timepoint t does not satisfy g, then ¬g is true at t . If both g and h aretrue at t , then g ∧ h is true at t . If g is true or h is true at t , then g ∨ h istrue at t , and if ¬g is true at t or h is true at t , then g → h is true at t .

3. If f and g are state formulas, and 0 ≤ r ≤ s ≤ ∞ with r = ∞,f U≥r,≤s g and f W ≥r,≤s g are path formulas.

The “until” path formula f U≥r,≤s g is true for a sequence of times beginningat time t if there is a time i where r ≤ i ≤ s, such that g is true at t + i and∀ j : 0 ≤ j < i , f is true at t + j . The “unless” path formula f W ≥r,≤s g istrue for a sequence of times beginning at time t if either f U≥r,≤s g is truebeginning at time t , or ∀ j : 0 ≤ j ≤ s, f is true at t + j .

4. If f and g are state formulas, then f �≥r,≤s g, where 0 ≤ r ≤ s ≤∞ and r = ∞ is a path formula.

Leads-to formulas must now be treated separately in order for their probabil-ities to be correctly calculated from data. Recall that leads-to was originallydefined using F≥r,≤se, where the associated probability of the leads-to isthat of the F (“finally”) part of the formula. Thus, the calculated probabilitywould be that of e occurring within the window r–s after any timepoint,while we actually want the probability of e in the window r–s after c. Whenchecking formulas in a structure, there is no such difficulty, as the proba-bilities are calculated relative to particular states. However, when checkingformulas in traces, we do not know which state a timepoint corresponds toand as a result can only calculate the probability relative to the trace. Thus,the formula f �≥r,≤s g is true for a sequence of times beginning at time tif f is true at t and there is a time i , where r ≤ i ≤ s, such that g is true att + i . When r = 0, this reduces to the usual case of leads-to with no lowerbound.

5. If f is a path formula and 0 ≤ p ≤ 1, [ f ]≥p and [ f ]>p are stateformulas.




The probabilities here are in fact conditional probabilities. For[ f U≥r,≤s g]≥p the probability p′ associated with the data is estimated asthe number of timepoints that begin paths satisfying f U≥r,≤s g divided bythe number of timepoints labeled with f ∨ g. The formula [ f U≥r,≤s g]≥p

is satisfied by the trace or set of traces if p′ ≥ p. For a W formula, theprobability is estimated the same way as for the preceding case, except thatwe consider the timepoints beginning paths satisfying f W ≥r,≤s g (whichincludes paths where f holds for s time units, without g later holding).For a leads-to formula, h = f �≥r,≤s g, the probability is estimated asthe number of timepoints that begin sequences of times labeled with h,divided by the number of timepoints labeled with f . Thus, the probabilityof f �≥r,≤s g is the probability, given that f is true, that g will be true inbetween r and s units of time.

Let us see that this formulation yields the desired result. Let p =P(gt ′ | ft ), where t + r ≤ t ′ ≤ t + s. Dropping the time subscripts for themoment, by definition:

P(g| f ) = P(g ∧ f )

P( f ).

Since the probabilities come from frequencies of occurrence in the data,P(x) for some formula x is the number of timepoints labeled with x dividedby the total number of timepoints. Using #x to denote the number of time-points with some label x , and T to denote the total number of timepoints,we find:

P(g| f ) = #(g ∧ f )/T

# f/T

= #(g ∧ f )

# f.

The probability of such a formula is the number of states beginning pathssatisfying the leads-to formula, divided by the number of states satisfying f .

Let us summarize the syntax and semantics of PCTL relative to tracesusing a minimal set of operators (all others can be defined in terms of these).PCTL trace

syntax With a set of boolean-valued atomic propositions a ∈ A,

State formulas:ϕ ::= true | a | ¬ϕ | ϕ1 ∧ ϕ2 | [ψ]≥p | [ψ]>p

Path formulas:ψ ::= ϕ1U≥r,≤sϕ2 | ϕ1 �

≥r,≤s ϕ2




where 0 ≤ r ≤ s ≤ ∞, r = ∞, and 0 ≤ p ≤ 1. W can be defined in termsof U , and ∨ and → can also be derived from the operators above.

Now we can recap the semantics of PCTL. There is a labeling function PCTL tracesemanticsL(t) that maps timepoints to the atomic propositions true at them. We

represent the satisfaction of formula f by a timepoint t in trace T ast |=T f . We denote a path (sequence of timepoints) by π , and the subsetof π beginning at time i by π i . A particular time ti in π is written as π[i].Thus, we might have the sequence π = a, ab, b, ac, where ab ∈ L(π[1]).

The probability of an until formula ϕ1U≥r,≤sϕ2 is:

|{t ∈ T : π t |=T ϕ1U≥r,≤sϕ2}||{t ∈ T : t |=T ϕ1 ∨ ϕ2}| (5.2)

and the probability of a leads-to formula ϕ1 �≥r,≤s ϕ2 is:

|{t ∈ T : π t |=T ϕ1 �≥r,≤s ϕ2}|

|{t ∈ T : t |=T ϕ1}| (5.3)

This gives the fraction of timepoints beginning paths satisfying the leads-toformula out of all of those satisfying ϕ1.

The satisfaction relation (|=T ) is then:4

t |=T true ∀t ∈ Tt |=T a if a ∈ L(t)t |=T ¬ϕ if not t |=T ϕ

t |=T ϕ1 ∧ ϕ2 if t |=T ϕ1 and t |=T ϕ2

π |=T ϕ1U≥r,≤sϕ2 if there exists a j ∈ [r, s] such thatπ[ j] |=T ϕ2 and π[i] |=T ϕ1, ∀i ∈ [0, j)

π |=T ϕ1 �≥r,≤s ϕ2 if π[0] |=T ϕ1 and there exists a

j ∈ [r, s] such that π[ j] |=T ϕ2

T |=T [ψ]≥p if the probability of ψ in T is ≥ pT |=T [ψ]>p if the probability of ψ in T is > p

Prima facie causes are those in the set of hypotheses where the associatedprobability, calculated from the data using this approach, is greater than theprobability of the effect alone (its marginal probability) and where the

4 In some works, such as that of Hansson and Jonsson (1994), the satisfaction relation for paths isdistinguished from that of states using a symbol with three horizontal lines rather than two (asin �), but this is not universal and the symbol can be difficult to produce.




relationship satisfies our other conditions – that c has a nonzero probabilityand is prior to e.

5.2. Testing for Causal Significance

The previous chapter introduced εavg, a new measure for causal signif-icance. This section describes how it can be calculated from data anddiscusses methods for determining which values are both causally – andstatistically – significant. This measure tells us, on average, how much ofa difference a cause makes to the probability of an effect holding fixedother possible explanations. It can be nonzero even in the absence of causalrelationships, and in fact will be normally distributed in the absence ofcausal relationships when many hypotheses are tested. One must choosethresholds in all methods (such as with conditional independence tests), butin the interest of making this book self contained, one strategy for doingthis in a statistically rigorous way is discussed (though there are multiplemethods for doing this).

5.2.1. Computing εavg

Let us recall the definition for εavg. With X being the set of prima faciecauses of e, we assess the impact of a particular c on a particular e using:

εavg(c, e) =

∑x∈X\c

εx (c, e)

|X \ c| , (5.4)

where:

εx (c, e) = P(e|c ∧ x) − P(e|¬c ∧ x). (5.5)

Here we are interested in c and x , where the relationships are representedby c �≥s,≤t e and x �≥s ′,≤t ′

e. While the temporal subscripts have beenomitted for ease, c ∧ x refers to c and x being true such that e could becaused in the appropriate intervals. P(e|c ∧ x) is defined as P(eA|cB ∧ xC )where this is the probability of e occurring at any such A where the timesubscripts are not specific times but rather denote the constraints on theirrelationship. That is,

B + s ≤A ≤ B + t, and

C + s ′ ≤A ≤ C + t ′.



5.2. Testing for Causal Significance 121

This probability is calculated with respect to a set of data (or, when given, aprobabilistic structure) using the satisfaction rules described earlier fordetermining when c and e are true, and using the same approach forfrequency-based probabilities calculated from data. If part of the observedsequence is c at time 0 and x at time 15, where s = s ′ = 20 and t = t ′ = 40,then e must occur in the overlap of these windows, shown in the followingin solid gray.

time

c

0

x

15 20 35 40 55

This will be considered an instance of (c ∧ x) � e if there is an observationeA such that: 20 ≤ A ≤ 40 and 35 ≤ A ≤ 55. If e were true at A = 10, thenonly c would have been true before e, while if e were true at A = 50, thenc’s time window to cause e would be over.

The probability calculation is exactly as described for leads-to formulasin the previous section. Dropping the time subscripts for the moment, wehave:

P(e|c ∧ x) = #(e ∧ c ∧ x)

#(c ∧ x), (5.6)

and

P(e|¬c ∧ x) = #(e ∧ ¬c ∧ x)

#(¬c ∧ x), (5.7)

where these refer to the number of paths where e holds after c ∧ x (or¬c ∧ x) holds, in the appropriate time window, divided by the number ofpaths where c ∧ x (or ¬c ∧ x) holds. These paths are subsequences of thetraces (observations), and there may be multiple occurrences of c ∧ x ∧ ein each trace.

There are a variety of methods that can be used to efficiently calculate thiscausal impact for a set of relationships. The calculation of each individualεavg is independent from all the other calculations, so this can be easilyparallelized. Let us look at one straightforward method for calculatingεx (c, e) relative to a trace, T (shown in algorithm 5.1), where c and x havecorresponding relationships c �≥r,≤s e and x �≥r ′,≤s ′

e. Assume that alltimes satisfying c, x , and e are already labeled with these formulas. Then,c ∧ x refers to c and x holding such that either could be a cause of x .




Algorithm 5.1 εx (c, e)1. cT = {t : c ∈ labels(t)}

xT = {t : x ∈ labels(t)}eT = {t : e ∈ labels(t)}

2. W = W ′ = ∅3. E = E ′ = 0

{Get times satisfying c ∧ x}4. for all t ∈ cT do5. if ∃t ′ ∈ xT : [t + r..t + s]

⋂[t ′ + r ′..t ′ + s ′] = ∅ then

6. W = W⋃{(t, t ′)}

7. end if8. end for

{Get times satisfying ¬c ∧ x}9. for all t ′ ∈ xT do

10. if �t ∈ cT : [t + r..t + s]⋂

[t ′ + r ′..t ′ + s ′] = ∅ then11. W ′ = W ′ ⋃{t ′}12. end if13. end for

{Get times satisfying c ∧ x ∧ e}14. for all (t, t ′) ∈ W do15. if ∃t ′′ ∈ eT : t ′′ ∈ [t + r..t + s]

⋂[t ′ + r ′..t ′ + s ′] then

16. E + +17. end if18. end for

{Get times satisfying ¬c ∧ x ∧ e}19. for all t ′ ∈ W ′ do20. if ∃t ′′ ∈ eT : t ′′ ∈ [t ′ + r ′..t ′ + s ′] then21. E ′ + +22. end if23. end for24. return E

|W | − E ′|W ′|

The primary task of the algorithm is to identify instances of c ∧ x that fitthese criteria, and then to identify instances of e that fall in the overlapof the time windows from these instances. Similarly, for ¬c ∧ x , we findinstances of x where there is no overlapping window with an instanceof c.

In summary, we begin with a set of prima facie causes (identified bygenerating or otherwise specifying some set of potential relationships and




determining which of these satisfy the conditions for prima facie causalityrelative to the given data), and then compute the average causal significancefor each of these, yielding a set of εavg’s.

5.2.2. Choice of ε

Once we have calculated the average causal significance for each relation-ship, we must determine a threshold at which a relationship is causallysignificant, in a statistically significant way. All methods require a choice ofthresholds, such as determining when two variables should be consideredconditionally independent (as one cannot expect to find exact indepen-dence), or whether including lagged values of one variable significantlyimproves prediction of another in the case of Granger causality. One couldpotentially determine appropriate thresholds through simulation (creatingdata with a structure similar to that of the real data of interest), or byexamining the hypotheses manually. However, when testing many hypothe-ses simultaneously, we can use the properties of εavg to do this in a morerigorous way. In the absence of causal relationships, these values will benormally distributed, so we can use the large number of tests to our advan-tage by making one more assumption, that even if there are many genuinecauses in the set tested, these are still relatively few compared with the totalnumber of hypotheses tested. Then, we can treat the data as coming froma mixture of two distributions, one of the noncausal relationships (withnormally distributed significance scores) and a smaller number of causalrelationships with scores distributed according to some other function.

All thresholds have tradeoffs: if ε is too low, too many causes will becalled significant (making false discoveries), while if ε is too high we willcall too many causes insignificant (leading to false negatives). In this work,I have concentrated on controlling the first case, so that while some causesmay be missed, we will be confident in those identified. The prioritiesof users may vary, though, and some may wish to focus on identifyingthe full set of causes (at the expense of some of those identified beingspurious). Many statistical methods exist for both purposes. This workfocuses on controlling the false discovery rate (FDR), which is the numberof false discoveries as a proportion of all discoveries. Here the FDR isthe fraction of non-causes called significant as a proportion of all causesdeemed significant. The key point is that when doing many tests, it is likelythat seemingly significant results will be observed by chance alone (justas when flipping a coin many times in a row, some long runs of heads ortails are to be expected). To control for this, we generally compute some




statistic (such as a p-value) for each hypothesis, and compare these againstthe distribution expected under the null hypothesis (here that would be thata relationship is not causal). For a particular value of this statistic, we accepta hypothesis (rejecting the null hypothesis) if this value is significant whencompared with the null hypothesis after accounting for the number of testsbeing conducted. To define the distribution of the null hypothesis, here wewould need to know how the εavg’s would be distributed if there were nogenuine causal relationships. One could assume that these would follow anormal distribution with mean zero and standard deviation one (as this isoften the case), but as an alternative, methods using empirical nulls allowone to estimate the null directly from the data.

For a more technical introduction to multiple hypothesis testing andfalse discovery rate control, see appendix A. It is assumed that the readeris familiar with the goals and procedures of these methods, so only theempirical null is discussed in this section.

Calculating the fdr

We assume the results mostly fit a null model when there are no causalrelationships (shown experimentally to be the case in chapter 7), withdeviations from this distribution indicating true causal relationships. Insome cases, the εavg’s follow a standard normal distribution, with the z-values calculated from these εavg’s having a mean of zero and a standarddeviation of one. These εavg’s (even with no true causal relationships in thesystem) are not all equal to zero due to correlations from hidden commoncauses and other factors influencing the distributions, such as noise. Thedistribution of εavg tends toward a normal due to the large number ofhypotheses tested.

When there are causal relationships in the system, then there are twoclasses of εavg’s: those corresponding to insignificant causes (which maybe spurious or too small to detect) and those corresponding to significantcauses (which may be genuine or just so causes), with the observed dis-tribution being a mixture of these classes. Since the insignificant class isassumed to be much larger than the significant class, and normally dis-tributed, we can identify significant causes by finding these deviations fromthe normal distribution. This can be observed for even a seemingly smallnumber of variables, as when testing pairwise relationships between say 20variables that are only true or false, that means 400 to 1600 hypotheses arebeing tested. Depending on how the statistical significance is evaluated andwhich thresholds are chosen, some of these insignificant causes may appear




statistically significant. However, it is shown experimentally in chapter 7that in datasets where there is no embedded causality, none is inferred usingthe approach developed in this book.

One method for accounting for the large number of tests is by usinglocal false discovery rate (fdr) calculations. Instead of computing p-valuesfor each test and then determining where in the tail the cutoff should beafter correcting for the many tests conducted, as is done when controllingthe false discovery rate (FDR), this method instead uses z-values and theirdensities to identify whether, for a particular value of z, the results arestatistically significant after taking into account the many tests (Efron,2004). When using tail-area false discovery rates (the rate when rejectingthe null for all hypotheses with z greater than some threshold), the valuesclose to the threshold do not actually have the same likelihood of being falsediscoveries as do those further out into the tail. Two methodological choicesare discussed here: how to calculate the false discovery rate, and how tochoose the null hypothesis. The local false discovery rate method can beused with an empirical null (inferred from the data) or standard theoreticalnull distribution. Similarly the method of finding the null from the data can See Efron

(2010) for anin-depthintroductionto large-scaletesting.

also be used in conjunction with standard tail-area FDR (Efron, 2007).When testing N causal relationships, each can be considered as a hypoth-

esis test, where we can accept the null hypothesis, that the relationship is notcausal, or can reject it. For each of the εavg values, we calculate its z-value(also called the standard score), which is the number of standard deviationsa result is from the mean. The z-value for a particular εavg is defined asz = (εavg − μ)/σ , where μ is the mean and σ the standard deviation of theset of εavg values. The N results correspond to two classes, those when thenull hypothesis is true (there is no causal relationship) and those when it isfalse (and the relationship is causal). When using this method for determin-ing which relationships are statistically significant, it is assumed that theproportion of non-null cases, which are referred to as significant (or “inter-esting”) is small relative to N . A common assumption is that these are say10% of N . This value is simply a convention and it is not required that thisis the case. Even with a seemingly densely connected set of variables suchas a gene network, this still may hold as with approximately 3,000 genes,we would be testing 30002 relationships, and ten percent of this is still90,000 relationships. Then p0 and p1 are the prior probabilities of a case(here a causal hypothesis) being in the insignificant and significant classesrespectively. These correspond to rejection of the null hypothesis with priorprobabilities p0 and p1 = 1 − p0. The probabilities are distributed accord-ing to density functions f0(z) and f1(z). When using the usual theoretical




null hypothesis, f0(z) is the standard N (0, 1) density. We do not need toknow f1(z) and because this class is much smaller than the null class, itwill not perturb the observed mixture of the two distributions significantly.The observed z values are defined by the mixture of the null and non-nulldistributions:

f (z) = p0 f0(z) + p1 f1(z), (5.8)

and the posterior probability of a case being insignificant given its z-value,z, is

P(null|z) = p0 f0(z)/ f (z). (5.9)

The local false discovery rate is:

f dr (z) ≡ f0(z)/ f (z). (5.10)

In this formulation by Efron (2004), the p0 factor is not estimated, sothis gives an upper bound on f dr (z). Assuming that p0 is large (close to1), this simplification does not lead to massive overestimation of f dr (z).One may instead choose to estimate p0 and thus include it in the FDRcalculation, making f dr (z) = P(null|z). To estimate f0(z), most methodswork by locating and fitting to the central peak of the data. Since it isassumed that the underlying distribution is normal, one need only find itsmean and standard deviation (which amounts to finding the center of thepeak and its width). To find f (z), one may use a spline estimation method,fitting the results. The procedure after testing which relationships meet thecriteria for prima facie causality and calculating their εavg values is then:

1. Calculate z-values from εavg.2. Estimate f (z) from the observed z-values.3. Define the null density f0(z) from either the data or using the theo-

retical null.4. Calculate f dr (z) using equation (5.10).

For each prima facie cause where the z-value associated with its εavg hasf dr (z) less than a small threshold, such as 0.01, we label it as a just so, orsignificant, cause. With a threshold of 0.01, we expect 1% of such causesto be insignificant, despite their test scores, but now the threshold can bechosen based on how acceptable a false discovery is rather than an arbitraryvalue at which a relationship is significant.



5.3. Inference with Unknown Times 127

5.3. Inference with Unknown Times

I have so far discussed how to find whether a relationship is causally sig-nificant, but this only involved accepting or rejecting this hypothesis for arelationship between two factors with a particular window of time betweenthem. However, we will not usually know these timings prior to inferenceand want to ensure both that the correct timings will be found and that anincorrect initial choice will not lead to incorrect inferences. Instead of onlyaccepting or rejecting hypotheses, we want to refine them. What is needed isa framework for automatically finding the timing of relationships as part ofthe inference process, unconstrained by the initial set of hypotheses tested.We want to potentially begin by searching for relationships between labvalues changing 1–2 weeks before congestive heart failure and ultimatelymake inferences such as “extremely high AST predicts heart failure in 3–10days.” The goal is to make some initial suggestions of the general range ofpossible times, with the inference procedure taking this as a starting point,inferring the timing of relationships in a way that is not constrained by thisinitial proposal. The solution is that we can generate a set of candidate win-dows that cover the whole time series (with multiple observation sequences)or a section of it relating to the times of interest (maybe constrained by somebackground information on what relationships are possible) and alter theseduring the inference process to recover the actual windows.5 We initiallyproceed as described in the previous section, finding the relationships thatare significant within each candidate window. We then iterate over thesignificant relationships, perturbing the timing associated with each andattempting to maximize the associated significance scores. Iterating overonly the significant relationships allows the procedure to remain compu-tationally efficient and enables inference of temporal relationships withoutbackground knowledge. We first examine the intuition behind the methodalong with the primary assumptions made before discussing the algorithm.Its correctness and complexity are discussed in section 5.4.

5.3.1. Intuition and assumptions

Before examining the approach in depth, let us discuss some basic obser-vations about εavg. If we can identify a significant relationship between

5 With data from ICU patients who are monitored at the timescale of seconds, it is unlikely that onesecond of data is informative about another second a few weeks later, so it is likely not necessaryto generate such hypotheses. One may also use windows that increase in size as temporal distancefrom the effect increases.




some c and e with a time window that intersects the correct one, this can beused to find the actual time window. There are three main ways a windowcan intersect the true one: (1) it can contain it, (2) it can be shifted (andcontain at least half the window), or (3) it can be contained by the window.If there is a window overlapping less than half the true window, there mustbe an adjacent window that covers the rest or is fully contained in the truewindow.

(1)(2)(3)

actual window

We use changes in εavg to assess new windows created by modifying theoriginal one. The main operations are contracting, shifting, and expandingthe window. Each of these will (barring issues such as missing data, dis-cussed below) increase the εavg associated with the cases above respectively.Remember that εavg is defined by:εavg is

discussed indepth in

section 4.2.2and its

calculationfrom data isdiscussed in

section 5.2.1.

P(e|c ∧ x) − P(e|¬c ∧ x) (5.11)

averaged over all of the x ′s in X (the set of prima facie causes of e).However, there are time windows associated with the relationships and asnoted before, instances of c ∧ x are those where c and x occur such thattheir windows overlap and either could cause e. Then, c ∧ x ∧ e is when eoccurs in that time window. So if we have the following case, where [r, s]is wider than the true window (shown in solid grey):

c r s

x r ′ s ′

then we are considering too many cases as being c ∧ x . We should only con-sider instances where c occurs and e follows in the grey window (and thisoverlaps the window for x), but instead this overly wide window considersmore occurrences of c and x (where e is unlikely to occur) as being instancesof this relationship. Since e will not occur, this increases the denominatorof P(e|c ∧ x), which is defined as #(c ∧ x ∧ e)/#(c ∧ x), with no corre-sponding increase in the numerator, lowering this value and decreasing thedifference. If the window’s width is correct but its start and end points areshifted, then we exclude the same number of correct times as we include




incorrect times. Finally, when the window is too narrow, we make the mis-take of characterizing instances of c ∧ x as instances of ¬c ∧ x . Both overlysmall and large windows will reduce the value of εavg for a significant rela-tionship. This is why the windows do not simply expand without bound. Ifone instead tried to maximize the conditional probability P(e|c), then theprobability would continue to increase as the window size increased andthere would be no reason for it to contract. By evaluating the relationshipsusing the average difference in conditional probability, an overly large win-dow reduces the value of this difference by providing more chances for acause to be spurious.6

Now let us look at the assumptions that this rests on.

Assumption 1. A significant relationship with associated time windoww =[ws, we] will be found to be significant in at least one time window thatintersects w.

We assume that relationships will be identified during the initial testingacross many windows. This depends primarily on the windows being at asimilar timescale as the true relationships (not orders of magnitude wider)and the observations of the system being representative of the true distribu-tion (not systematically missing). Note that it is not necessary for a cause toraise the probability of its effect uniformly through the time windowthough.7 In some cases the probability may be raised uniformly throughoutthe window, while in other cases the distribution may look Gaussian, withthe probability being significantly increased near a central peak, and lessso by the ends of the time period. This is not an impediment to inference,assuming we sample frequently enough from the peak. If the windows areorders of magnitude wider though, the relationship will be so diluted that wecannot expect to identify it. If the observations are systematically missing

6 A similar method could also be applied to discretization of continuous-valued variables such aslab measurements, by allowing us to propose an initial range based on standardly used normalvalues and refine this after inference. For example, we may have some initial idea of how topartition values of glucose based on prior knowledge about usual low/normal/high ranges, butwe may not know that for a particular result a value ≥140 is required, even though in generala value above 120 is considered high. Relationships would be represented using a logic thatallows expression of constraints on continuous-valued variables (Kleinberg, 2011). Instead ofdiscretizing the variables prior to inference, constraints on their values would be included as partof the logical formulas representing relationships. For example, we may have:

[(c ≥ v) ∧ (c ≤ w)] �≥r,≤s e. (5.12)

Then, v and w would be altered in each step instead of r and s.7 For example, a cause c may raise the probability of an effect e at every time in 8–14 days (i.e., at

each time in this window P(e|c) > P(e)). However, if the cause perhaps acts through two differentmechanisms, there could be peaks around 10 and 12 days so that P(e|c10) > P(e|c9) > P(e),for example. The probability is raised at each time, but the increase is larger at particular times.




or nonuniform, and the probability is non-uniform, we may also be unableto identify the true relationships.

While this approach may seem similar to that of dynamic Bayesiannetworks (DBNs), there are some fundamental differences. With DBNs auser proposes a range of times to search over, and then relationships arefound between pairs of variables at each of the lags in that set of times.This assumes that if one searches over the range [2, 20] and one of therelationships is a causes b in 3–8 time lags, we will find edges between aat time t and b at times t + 3, t + 4, . . . , t + 8. This is unlikely given thenoisiness and sparseness of actual data and computational issues (since thefull set of graphs is not explored) and one may only infer that a leads to bin 3, 6, and 7 time units. The key distinction is that DBNs assume that anedge between a and b will be inferred for every time in the actual window,while we assume only that we will find a relationship between a and b in atleast one time range intersecting the true window (and from that time rangecan find the true window).

We make one more assumption that pertains primarily to cases of large-scale testing, when a significant number of hypotheses are explored. If theoverall set is not so large, this is not needed.

Assumption 2. The significant relationships are a small proportion of theoverall set tested.

This assumption is for the sake of computational feasibility. Since weare only refining the relationships found to be ε-significant, the procedure isdone on a much smaller group of relationships than the full set tested, whichwe assume is quite large. If we can identify significant relationships in atleast one window that intersects their true timing, and this set of significantrelationships is not enormous, then we can limit the number of hypothesesthat need to be refined. If instead we conducted the iterative procedure onthe full set of hypotheses tested (significant and insignificant) it would bequite computationally intensive.

5.3.2. Algorithms

Let us now look at the details of the algorithms for inferring the timingof relationships without prior knowledge. For each significant relationship,we iteratively reevaluate its εavg with a new set of windows created byexpanding, shrinking, and shifting the current window. At each step, themaximum value of εavg is compared against the previous value. If the score




improves, the highest scoring window is the new starting point and iterationcontinues. Otherwise, if there is no improvement in the score or no furthersteps are possible (the window’s lower bound must be no less than one, andthe upper bound cannot be greater than the length of the time series), thecurrent window is returned.

Where 1 ≤ ws ≤ we ≤ T , wei = wsi+1, and T is the length of the timeseries, the procedure is as follows.

1. W ← [[ws1, we1], [ws2, we2] . . . [wsn, wen]] .A ← set of atomic propositions or formulas.

2. Generate hypotheses c �≥ws,≤we e for each pair c, e ∈ A and[ws, we] ∈ W .

3. For each [ws, we] ∈ W , test associated hypotheses finding those thatare ε-significant.

4. For each ε-significant relationship, c �≥ws,≤we e.

(c, e, ws, we) ← refine-ε(c, e, ws, we).

5. Where Sw is the set of all ε-significant relationships in w ∈ W , andS = Sw1

⋃Sw2

⋃. . .

⋃Swn , recompute εavg for all hypotheses with

X = X⋃

S.6. Reassess statistical significance of all hypotheses using the newly

calculated εavg.

Steps 1–3 are as described before, with the only difference being that this isrepeated for a set of adjacent windows and the significance of relationshipswithin each window is evaluated separately.

In step 4, we iterate over the candidate windows (w ∈ W ) and the rela-tionships found to be significant. We perturb and refine the window asso-ciated with each relationship using the refine-ε procedure of algorithm 5.2until no alterations improve the associated εavg for the relationship. Notethat the significant relationships are assumed to be a small fraction (usu-ally around 1%) of the set tested, so this is done on a much smaller set ofrelationships than for the initial testing. For each relationship, using algo-rithm 5.2 we repeatedly recompute the significance the cause makes to itseffect as we explore new potential timings (expanding, shrinking, or shiftingthe initially proposed window). This algorithm expands each window byhalf (evenly on each side), splits evenly down the middle, and shifts by onetime unit left and right. This is a heuristic and many others may be investi-gated for efficiently exploring the search space. With sufficient computingpower one may choose to expand/contract by one time unit on each end of




Algorithm 5.2 refine-ε(c, e, ws, we)1. εnew ← εavg(c, e, ws, we)2. repeat3. t ← we − ws4. εmax ← εnew

5. εnew =

max

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

εavg(c, e, (ws − �t/4�), (we + �t/4�))

εavg(c, e, ws, (ws + �t/2�))

εavg(c, e, (ws + �t/2�), we)

εavg(c, e, (ws − 1), (we − 1))

εavg(c, e, (ws + 1), (we + 1))6. Update ws, we with values from new max7. until εnew ≤ εmax

8. return (c, e, ws, we)

the candidate window (in separate steps). This is still not intractable if onecaches results intelligently. While in theory this procedure has the sameworst case behavior as the heuristic proposed, in practice the heuristic willconverge more quickly, though it may only approach the fixpoint. However,in many cases variables are not sampled at every time unit or at a scalerepresentative of the true timing, and measurements may be sparse. Thus,an approach that makes smaller steps can get stuck in local maxima in thesecases and the heuristic is more suitable.

Finally, in step 5, the significance scores are recalculated in light ofthe inferred relationships. Remember that the significance is the averagedifference a cause makes to its effect holding fixed (pairwise) all othercauses of the effect. This final step was not necessary before since thebackground was unchanged – some things compared against may turn outto be spurious, but a single set of relationships would be tested and thus eachcause would have been compared against each other thing that could havemade it spurious. Now we test relationships separately, so that if A causesB and B causes C (both in 20–40 time units), and we use windows 20–40and 40–60, A may seem to cause C when looking at the later window alonesince the relationship between B and C is not fully taken into account. Thus,we now ensure that B will be included in the background when assessingA’s significance during the final recalculation of εavg. This step also allowsus to identify repeated observations of cycles as such.



5.4. Correctness and Complexity 133

5.4. Correctness and Complexity

This section shows that the procedures for verifying formulas over tracesand inferring the timing of relationships are correct and analyzes theircomputational complexity.

5.4.1. Correctness

The correctness of methods for labeling timepoints with non-probabilisticstate formulas is trivial, so this section focuses on path formulas, stateformulas formed by adding probability constraints to path formulas, andmethods for inferring the timing of relationships.

Correctness of procedure for checking until formulas in traces

Theorem 5.4.1. The satisfaction by a path beginning at timepoint t of theuntil formula f U≥r,≤s g, where 0 ≤ r ≤ s < ∞ is given by:

satU (t, r, s) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

true if (g ∈ labels(t))

∧(r ≤ 0),

false if ( f /∈ labels(t))

∨(t = |T |)∨(s = 0),

satU (t + 1, r − 1, s − 1) otherwise.

(5.13)

Proof. Assume trace T , where times t ∈ T satisfying f and satisfying ghave been labeled. Then, it can be shown by induction that any time t willbe correctly labeled by equation 5.13. By definition, a timepoint t begins asequence of times satisfying f U≥r,≤s g if there is some r ≤ i ≤ s such thatg is true at t + i and ∀ j : 0 ≤ j < i , f is true at t + j .

Base cases:

satU (t, r, 0) ={

true if g ∈ labels(t),

false otherwise.(5.14)

satU (|T |, r, s) ={

true if (g ∈ labels(|T |)) ∧ (r ≤ 0),

false otherwise.(5.15)

In the first base case, since we have already stipulated that r ≤ s, weknow that if s = 0, r ≤ 0. However, in the second base case we must add




the condition on r , to ensure it is less than or equal to zero. If s = 0, theonly way the formula can be satisfied is if t is labeled with g. Similarly, ift = |T |, then this is the last timepoint in the trace and t can only satisfythe formula if it is labeled with g.

Inductive step: Assume we have satU (n, r, s). Then, for s > 0 andn + 1 = |T |:

satU (n − 1, r + 1, s + 1) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

true if (g ∈ labels(n − 1))

∧(r ≤ 0),

false if f /∈ labels(n − 1),

satU (n, r, s) otherwise.

(5.16)

Timepoint n − 1 satisfies the formula if it satisfies g or if it satisfies f andthe next timepoint, n, satisfies the formula. However, we assumed that wecan correctly label timepoints with f and g as well as sat(n, r, s).

Corollary. The satisfaction by a sequence of times beginning at timepointt of the until formula f U≥r,≤∞g where r = ∞ is given by satU (t, r, |T |).

Corollary. The probability of the formula f U≥r,≤s g, in a trace of times T ,where 0 ≤ r ≤ s ≤ ∞ and r = ∞ is given by:

|{t ∈ T : satU (t, r, s)}||{t ′ ∈ T : ( f ∨ g) ∈ labels(t ′)}| . (5.17)

Correctness of procedure for checking unless formulas in traces

Claim. The satisfaction by a timepoint t of the unless formula f W ≥r,≤s g,where 0 ≤ r ≤ s < ∞ is given by:

satW (t, r, s) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

true if (g ∈ labels(t) ∧ r ≤ 0)

∨( f ∈ labels(t) ∧ s = 0),

false if ( f /∈ labels(t))

∨(t = |T |)∨(s = 0),

satW (t + 1, r − 1, s − 1) otherwise.

(5.18)




Proof. Assume trace T , where times t ∈ T satisfying f and satisfying ghave been labeled. Then, we will show by induction that any time t will becorrectly labeled by equation (5.18). By definition, a timepoint t begins apath satisfying f W ≥r,≤s g if there is some r ≤ i ≤ s such that g is true att + i and ∀ j : 0 ≤ j < i , f is true at t + j , or if ∀ j : 0 ≤ j ≤ s, f is trueat t + j .

Base case:

satW (t, r, 0) =

⎧⎪⎪⎨⎪⎪⎩

true if (g ∈ labels(t))

∨( f ∈ labels(t)),

false otherwise.

(5.19)

satW (|T |, r, s) =

⎧⎪⎪⎨⎪⎪⎩

true if (g ∈ labels(|T |) ∧ r ≤ 0)

∨( f ∈ labels(|T |) ∧ s = 0),

false otherwise.

(5.20)

If s = 0, the only way the formula can be satisfied is if t is labeled witheither f or g. Similarly, if t = |T |, then this is the last timepoint in thetrace and t can only satisfy the formula if it is labeled with f or g in theappropriate time window.

Inductive step: Assume we have satW (n, r, s). Then, for s > 0 andn + 1 = |T |:

satW (n − 1, r + 1, s + 1) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

true if (g ∈ labels(n − 1))

∧(r ≤ 0),

false if f /∈ labels(n − 1),

satW (n, r, s) otherwise.

(5.21)

Timepoint n − 1 satisfies the formula if it satisfies g or if it satisfies fand the next timepoint, n, satisfies the formula. Note that we assumes > 0 and thus the formula cannot be satisfied by only f being true.However, we assumed that we can correctly label timepoints with f andg as well as sat(n, r, s).

Corollary. The satisfaction by a path beginning at timepoint t of the unlessformula f W ≥r,≤∞g where r = ∞ is given by satW (t, r, |T |).




Corollary. The probability of the formula f W ≥r,≤s g, in a trace of times T ,where 0 ≤ r ≤ s ≤ ∞ and r = ∞ is given by:

|{t ∈ T : satW (t, r, s)}||{t ′ ∈ T : ( f ∨ g) ∈ labels(t ′)}| . (5.22)

Correctness of procedure for checking leads-to formulas in traces

Claim. The satisfaction by a path beginning at timepoint t of the leads-toformula f �≥r,≤s g is given by:

satL (t, r, s) =

⎧⎪⎪⎨⎪⎪⎩

true if f ∈ labels(t)

∧(trueU≥t+r,≤t+s g) ∈ labels(t),

false otherwise.

(5.23)

Proof. Assume that there is a trace T , where times t ∈ T satisfying f andsatisfying g have been labeled. We have already shown that we can correctlylabel times that begin sequences where until formula are true, and thus wecan correctly label whether a state t satisfies trueU≥t+r,≤t+s g. We havealso assumed that states satisfying f are already correctly labeled with fand that we can label timepoints with conjunctions, so we can label stateswith the conjunction of these formulas. By definition of leads-to – that gholds in the window [r, s] after f – we can correctly label times with suchformulas.

Corollary. The satisfaction by a path beginning at timepoint t of theleads-to formula f �≥r,≤∞ g or f �≥r,<∞ g, where r = ∞ is given bysatL (t, r, |T |).Corollary. The probability of the formula f �≥r,≤s g, in a trace of timesT , where 0 ≤ r ≤ s ≤ ∞, where r = ∞ is given by:

|{t ∈ T : satL (t, r, s)}||{t ′ ∈ T : f ∈ labels(t ′)}| . (5.24)

This case is similar to the until and unless case, with the exception thatthe denominator consists of the set of states satisfying f , instead of f ∨ g,since the probability is interpreted as being the conditional probability of gin the window r–s after f .




P(e)

P(e|c)

sr

(a)P(e)

P(e|c)

r s

(b)

Figure 5.2. Varying distributions of the probability of an effect, e, given a cause, c.

Correctness of procedure for inferring time windows

I now discuss why the procedure for inferring the time windows associatedwith relationships will converge toward the true times. Note that the falsediscoveries are unchanged from the case when the timings are known, sincethese are primarily due to chance, or unmeasured common causes, so wefocus on showing that the algorithm converges toward the actual timing.Assumption 1 states that a significant relationship will be found to besignificant in at least one window intersecting the true window. Implicitis the assumption that if cause leads to effect at two disjoint timescalesthe window does not overlap both. Note, however, that this is an unlikelycase and that a case where the probability of the effect is increased in abimodal manner after the cause (where there are two peaks and a dip inbetween that is still greater than the marginal probability of the effect) canbe handled by the approach without issue. The approach will not fit toonly one peak, as doing so would significantly decrease the value of εavg

(as there would be many seeming cases of ¬c ∧ x leading to e). The onlydifficult case is where there are two disconnected time windows, where atsome point in between the probability is less than or equal to the marginalprobability of the effect. These two cases can be illustrated as shown infigure 5.2. When the testing uses small windows we may be able to findeach window individually while a large initial window may lump bothtime periods together. However, it is unlikely that such a case can occurwithout there being another factor that determines the timing (for examplea cause may operate through different mechanisms). Some scenarios such asrepeated observations of a cycle may appear to have this structure, thoughthey do not and can be correctly handled by the approach. Finally, thismethod converges to the most causally significant timing given the data,that is the timing with the most evidence in the data. Due to the granularity of




measurement and the nature of the relationship, some cases may have a sharptransition between when it holds and when it does not (such as the efficacyof an emergency contraceptive). In other more challenging cases, we arestill able to find the relative timescale of the relationship but the windowboundaries may be less exact. While smoking may have the capacity to becarcinogenic immediately, the usual timing is likely much longer and we willbe able to infer this from the data, determining the significance of variouspossible timings. This information is important when planning interventionsand developing methods for verifying hypotheses experimentally. The nextchapter discusses how to use these relationships to explain particular (token)occurrences, and does so in a way that allows uncertainty about timinginformation. This is partly because inferring a particular timing does notimply that it is impossible for the relationship to occur outside of this.

We now show that if assumption 1 is met, we can find the true timingof the relationship. As noted, the algorithm described earlier in this chap-ter is a heuristic that approaches the exhaustive procedure. We show thecorrectness of that method and then show experimentally in chapter 7 thatthe heuristic described here behaves similarly (though this is one of manypossible approaches and future work should involve investigating otherstrategies).

Claim. Where refine-ε defines εnew as in (5.25), the true relationship isc �≥ws,≤we e, and [r, s] ∩ [ws, we] = ∅, the function refine-ε(c, e, r, s),converges to (c, e, ws, we).

εnew = max

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

εavg(c, e, (r + 1), s)

εavg(c, e, (r − 1), s)

εavg(c, e, r, (s + 1))

εavg(c, e, r, (s − 1))

(5.25)

Proof. We begin by showing that [ws, we] is a fixpoint of the algo-rithm. Once the function reaches the window [ws, we], no changes canincrease εavg and thus refine-ε(c, e, ws, we) = (c, e, ws, we). Implicit isthe assumption that the relationships are stationary (in that their timingdoes not change during the observation period) and that the time series isergodic.

At each timepoint εavg is recalculated with each of r ± 1 and s ± 1. If r =ws, lowering it adds incorrect instances of c when calculating P(e|c ∧ x),lowering this value (which is defined as #(c ∧ x ∧ e)/#(c ∧ x)). This isbecause it is assumed that more instances of c will lead to e, but being




outside the timing that will not happen, increasing the denominator with nocorresponding increase in the numerator. Further, this removes cases of ¬cthat would not have led to e from the right side of the difference, increasingthat value (and further reducing the difference). Similarly, increasing r shiftsinstances of c ∧ x that result in e to being cases of ¬c ∧ x , decreasingthe probability difference. The case is exactly the same for the endpoints = ws. Thus, since iterations only continue while perturbing the windowsresults in an increase in εavg and no modifications can increase this valuefor [r, s] = [ws, we], refine-ε(c, e, ws, we) = (c, e, ws, we), and it is afixpoint of this function.

Next, it is necessary to show that this is in fact the only fixpointfor (c, e, r, s). Assume that there are two fixpoints, (c, e, ws, we) and(c, e, ws ′, we′), where [ws, we] = [ws ′, we′]. As previously assumed,there is a significant relationship between c and e and both windows inter-sect [ws, we]. If [ws ′, we′] is a fixpoint, it means the associated εavg canno longer be improved. However, if ws > ws ′ then we can increase εavg

by increasing ws ′, since ws is the actual time and if ws > ws ′ instances of¬c ∧ x are misclassified as c ∧ x . Ifws ′ > ws then εavg can be increased bydecreasing ws ′, since we are missing positive instances of c, which insteadlook like cases of e occurring with ¬c. The argument wherewe differs fromwe′ is identical. Thus, since the εavg associated with (c, e, ws ′, we′) can beimproved, the algorithm will continue to iterate and this is not a fixpoint.Then the only fixpoint for (c, e, r, s) is (c, e, ws, we).

5.4.2. Complexity

We now analyze the time complexity for each of the algorithms and proce-dures discussed. Note that each procedure (aside from the model checkingones) assumes that all timepoints have already been labeled with the for-mulas of interest. The complexity of that task is not included in that of theother procedures since it is assumed that this is performed once, with theresults saved for use in the later tasks.

Complexity of model checking over traces

The complexity of labeling times along a trace, T , with a proposition isproportional to the length of the time series, which is also denoted by T ,making this O(T ). Assuming states are labeled with f and g, labelingthe sequence with each of ¬ f , f ∨ g, f ∧ g, and f → g is also of timecomplexity O(T ).




Next, we have until, unless, and leads-to path formulas, and finally thecalculation of the probabilities of these formulas. For an until or unlessformula, such as f U≥r,≤s g, the worst case for a single timepoint is whenr = 0 and involves checking the subsequent s timepoints. For s = ∞, theworst case complexity for the entire sequence is O(T s), while for s = ∞,it is O(T 2). However, these formulas naively assume all timepoints arelabeled with f and thus all t ∈ T are candidates for starting such a path.Instead of T , the formulas should use T ′, the number of states labeled withf (which may be significantly fewer than the total number of timepoints).For a leads-to formula, f �≥r,≤s g, the complexity for labeling a singletimepoint is O(|s − r |), where s = ∞. Where s = ∞, this is O(T ). As forthe until/unless case, assuming all timepoints are labeled with f , then thecomplexity for a trace is O(T × |s − r |) or O(T 2), though in practice mosttimes will not be labeled with f and thus these will be significantly reduced.

Once states have been labeled as the start of path formulas or with theappropriate state formulas, calculating the probability of a state formula isO(T ).

For any formula f , the worst case complexity of testing f in a trace T ,assuming that the subformulas of f have not already been tested, is thusO(| f | × T 2), where | f | is the length of the formula and T is the length ofthe trace.

Complexity of testing prima facie causality

For a single relationship, f �≥r≤s g, again assuming times satisfying f andg are labeled as such, we simply calculate the probability of this formulaalong the trace (O(T )) and compare this with the probability of F≤∞g(also O(T )). Thus, for M relationships the complexity is O(MT ). With Npossible causes of N effects, this is O(N 2T ).

Complexity of computing εavg

Assuming timepoints are already labeled with c, e, and x , the computationof εx (c, e) has complexity O(T ). Thus, in the worst case, computation ofone εavg(c, e) is O(N T ), where there are N causes and all N causes areprima facie causes of an effect e. To compute the significance for each causeof e this is repeated N times so the complexity is O(N 2T ). Finally, repeatingthis for all M effects, the complexity is O(M N 2T ). In the case where thecauses and effects are the same (say when testing relationships betweenpairs of genes), then N = M and the worst case complexity is O(N 3T ).




Complexity of inferring time windows

Calculating εavg for a single cause of an effect e with a single associatedtime window is O(N T ) where T is the length of the time series and N , thenumber of variables, is an upper bound on the number of potential causes ofe. At each iteration, we recompute εavg a fixed number of times, and there areat most T iterations. In each of the W windows there are at most N effectswith M significant causes of each. The worst case for the whole procedure isO(W M N 2T 2). Note that M is assumed to be a small fraction of N (∼ 1%)and W is generally much smaller than N , so the main component is N 2T 2.One factor of T is an upper bound on the number of iterations, so in practicethis is significantly lower (as shown in the experimental section). Recall thatthe complexity of the initial testing procedure where windows are knownis O(N 3T ), so the refinement procedure is comparable and usually muchfaster. This is shown experimentally in chapter 7.



causality, probability, and time || inferring causality

Documents