causality basics - github pages
TRANSCRIPT
Causality Basics(Based on Peters et al., 2017,
โElements of Causal Inference: Foundations and Learning Algorithmsโ)Chang Liu
2020.12.16
Contentโข Causality and Structural Causal Models.
โข Causal Inference.
โข Causal Discovery.
Causalityโข Formal definition of causality:
โtwo variables have a causal relation, if intervening the cause may change the effect, but not vice versaโ [Pearl, 2009; Peters et al., 2017].
โข Intervention: change the value of a variable by leveraging mechanisms and changing variables out of the considered system.
โข Example: for the ๐ดltitude and average ๐emperature of a city, ๐ด โ ๐.โข Running a huge heater (intv. ๐) does not lower ๐ด.
โข Raising the city by a huge elevator (intv. ๐ด) lowers ๐.
โข Causality contains more information than observation (static data).โข E.g., both ๐ ๐ด ๐ ๐ ๐ด (๐ด โ ๐) and ๐ ๐ ๐ ๐ด ๐ (๐ โ ๐ด) can describe the observed relation ๐ ๐ด, ๐ .
CausalityThe Independent Mechanism Principle.โข Principle 2.1 (Independent mechanisms)
The causal generative process of a systemโs variables is composed of autonomous modules that do not inform or influence each other.In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other conditional distributions.
โข It is possible to conduct a localized intervention.
โข We can change one mechanism without affecting the others.
Structural Causal ModelsFor describing causal relations.โข Definition 6.2 (Structural causal models) An SCM โญ โ ๐, ๐๐ consists of a collection ๐ of ๐
(structural) assignments, ๐๐ โ ๐๐ pa๐ , ๐๐ , ๐ = 1,โฆ , ๐ where pa๐ are called parents of ๐๐, and a joint distr. ๐๐ = ๐๐1,โฆ,๐๐
over the noise variables, which we require to be jointly independent.
โข The joint indep. requirement on ๐ comes from the Independent Mechanism Principle (the right block).
โข Causal graph ๐ข: the Directed Acyclic Graph constructed based on the assignments.
โข Proposition 6.3 (Entailed distributions) An SCM โญ defines a unique distribution over the variables ๐ = ๐1, โฆ , ๐๐ such that ๐๐ = ๐๐ pa๐ , ๐๐ , ๐ = 1,โฆ , ๐, in distribution, called the entailed distribution ๐๐
โญ and sometimes write ๐๐.
โข Implications of an SCM:
Structural Causal ModelsDescribing interventions using SCMs.โข โWhat if ๐ was set to ๐ฅ?โโWhat if patients took the treatment?โ
โข Definition 6.8 (Intervention distribution) Consider an SCM โญ โ ๐, ๐๐ and its entailed distribution ๐๐
โญ. We replace one (or several) of the structural assignments to obtain a new SCM เทฉโญ. Assume that we replace the assignment for ๐๐ by: ๐๐ โ แ๐ เทฆpa๐ , เทฉ๐๐ , where all the new and old noises are required to be jointly independent.โข Atomic/ideal, structural [Eberhardt and Scheines, 2007]/surgical [Pearl, 2009]/independent
/deterministic [Korb et al., 2004] intervention:
when แ๐ เทฆpa๐, เทฉ๐๐ puts a point mass on a value ๐. Simply write ๐๐โญ;๐๐ ๐๐โ๐
.
โข Soft intervention: intervene with a distribution.
๐
๐ ๐
๐
๐ก ๐
Structural Causal ModelsDescribing interventions using SCMs.
โข Definition 6.12 (Total causal effect) Given an SCM โญ, there is a total causal effect from ๐to ๐, if ๐ and ๐ are dependent in ๐๐
โญ;๐๐ ๐โเทฉ๐๐ for some r.v. เทฉ๐๐.
โข Proposition 6.13 (Total causal effects) Equivalent statements given an SCM โญ:1. There is a total causal effect from ๐ to ๐.
2. There are ๐ฅ 1 and ๐ฅ 2 such that ๐๐โญ;๐๐ ๐โ๐ฅ 1
โ ๐๐โญ;๐๐ ๐โ๐ฅ 2
.
3. There is ๐ฅ 1 such that ๐๐โญ;๐๐ ๐โ๐ฅ 1
โ ๐๐โญ.
4. ๐ and ๐ are dependent in ๐๐,๐โญ;๐๐ ๐โเทฉ๐๐ for any เทฉ๐๐ with full support.
โข Proposition 6.14 (Graphical criteria for total causal effects) Consider SCM โญ with corresponding graph ๐ข.โข If there is no directed path from ๐ to ๐, then there is no total causal effect.
โข Sometimes there is a directed path but no total causal effect.
Structural Causal ModelsDescribing counterfactuals using SCMs.โข โWhat if I took the treatment?โ
โWhat would the outcome be had I acted differently at that moment/situation?โ
โข Intervention {in a given situation/context} / {for a given individual/unit}.Intervention with a conditional.
โข Definition 6.17 (Counterfactuals) Consider SCM โญ โ ๐, ๐๐ . Given some observations ๐ฑ, define a counterfactual SCM by replacing the distribution of noise variables:
โญ๐=๐ฑ โ ๐, ๐๐โญ|๐=๐ฑ
, where ๐๐โญ|๐=๐ฑ
โ ๐๐|๐=๐ฑ (need not be jointly independent anymore).
โข โThe probability that ๐ would have been 7 (say)had ๐ been set to 2 for observation ๐ฑโ:
Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.โข Definition 6.1 (d-separation). A path ๐ is d-separated by a set of nodes ๐, iff. ๐ either has an
emitter (โ ๐ โ or โ ๐ โ) in ๐, or a collider (โ ๐ โ) that is not in ๐ nor are its descendants.
โข โข Theorem 6.22 (Equivalence of Markov properties). When ๐๐has a density.
โข (iii) is what we mean by using a DAG/BeyesNet to define a joint distribution (i.e., the entailed distribution by the SCM). Its equivalence to (i) means that we can read off conditional independence assertions of this joint distribution from the graph.
Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.
โข Confounding bias.โข ๐ and ๐ are correlated if ๐ is not given.
โข E.g., ๐= chocolate consumption, ๐= #Nobel prices, ๐= development status.
โข Simpsonโs paradox.โข ๐: treatment. ๐: recovery. ๐: socio-economic status.
โข Is the treatment effective for recovery?๐ ๐ = 1 ๐ = 1 > ๐ ๐ = 1 ๐ = 0 , but๐๐๐ ๐=1 ๐ = 1 < ๐๐๐ ๐=0 ๐ = 1 .
โข ๐ ๐ ๐ก = ฯ๐ฅ ๐ ๐ ๐ฅ, ๐ก ๐ ๐ฅ ๐ก :conditioning on ๐ = 1 infers a good ๐ thus a higher probability of recovery.
๐
๐ ๐
๐
๐ ๐
๐
๐ก ๐
Structural Causal ModelsObservational implications (SCM as a BayesNet); causality and correlation.
โข Selection bias.โข ๐ป and ๐น are correlated if secretly/unconsciously conditioned on ๐ .
โข Berksonโs paradox.โข โWhy are handsome men such jerks?โ [Ellenberg, 2014]
โข ๐ป= handsome, ๐น= friendly, ๐ = in a relation.
โข Single women often date men with ๐ = 0. Such men tend to be either not ๐ป or not ๐ .
โข (spurious correlations).
๐น
๐
๐ป
Structural Causal ModelsLevels of model according to the prediction ability:
๐๐๐ฉ๐ซ๐จ๐๐๐๐ข๐ฅ๐ข๐ฌ๐ญ๐ข๐/๐จ๐๐ฌ๐๐ซ๐ฏ๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐ฆ๐จ๐๐๐ฅ
(machine learning models)
+ intv. distrs.
๐ข๐ง๐ญ๐๐ซ๐ฏ๐๐ง๐ญ๐ข๐จ๐ง๐๐ฅ ๐ฆ๐จ๐๐๐ฅ (causal graphical models)
+ counterf. statements
๐๐จ๐ฎ๐ง๐ญ๐๐ซ๐๐๐๐ญ๐ฎ๐๐ฅ ๐ฆ๐จ๐๐๐ฅ (SCMs)
โข Interventional model has the utility of a probabilistic/observational model ๐ ๐, ๐ :
but cannot determine the interventional distribution ๐๐๐ ๐=๐ฅ ๐ :
โข Example 6.19: Two SCMs that induce the same graph, observational distributions, and intervention distributions, could entail different counterfactual statements (โprobabilistically and interventionally equivalentโ but not โcounterfactually equivalentโ).
๐ ๐ ๐ ๐
๐
๐ ๐
๐
๐ ๐
๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ง ๐ ๐ ๐ง ๐ ๐ ๐ง d๐ง ๐ ๐ง ๐ ๐ ๐ง ๐ ๐ ๐, ๐ง d๐ง
๐ ๐ ๐ฅ ๐ ๐ ๐ ๐ง ๐ ๐ ๐ง d๐ง = ๐ ๐ ๐ ๐ง ๐ ๐ ๐ฅ, ๐ง d๐ง
๐ ๐, ๐ =
๐๐๐ ๐=๐ฅ ๐ =
Structural Causal ModelsLevels of model according to the prediction ability:
Causal Inferenceโข Given an SCM, infer the distribution/outcome of a variable, under an intervention
or unit-level intervention (counterfactual).โข Estimate causal effect:
โข Average treatment effect: ๐ผ๐๐ ๐=1 ๐ โ ๐ผ๐๐ ๐=0 ๐ .
โข Individual treatment effect: ๐ผ๐๐ ๐=1 ๐|๐ฅ โ ๐ผ๐๐ ๐=0 ๐|๐ฅ .
โข Estimate the interventional distribution is the key task (counterfactual is the conditional in an intervened SCM).
Causal Inferenceโข Truncated factorization [Pearl, 1993] / G-computation formula [Robins, 1986] /
manipulation theorem [Spirtes et al., 2000]:
๐โญ;๐๐ ๐๐โเทฉ๐๐ ๐ฅ1, โฆ , ๐ฅ๐ = ๐ ๐ฅ๐ ฯ๐โ ๐ ๐โญ(๐ฅ๐|๐ฅpa๐), where ๐ is the density of เทฉ๐๐.
โข Almost evident from the interventional SCM.
โข Relies on the autonomy/modularity of causality (principle of independence of mechanisms):one causal mechanism does not influence or inform other causal mechanisms.
Causal Inferenceโข Definition 6.38 (Valid adjustment set, v.a.s) Consider an SCM โญ over nodes ๐. Let ๐, ๐ โ ๐ with ๐ โ pa๐. We call a set ๐ โ ๐\ ๐, ๐ a v.a.s for ๐, ๐ if ๐โญ;๐๐ ๐โ๐ฅ ๐ฆ = ฯ๐ณ๐
โญ ๐ฆ ๐ฅ, ๐ณ ๐โญ ๐ณ .โข Importance: we can then use the observations of variables in a v.a.s to estimate an
intervention distribution.
โข By the previous result, pa๐ is a v.a.s.
โข Are there any other v.a.sโs?
๐
๐ ๐
๐
๐ฅ ๐
Causal Inference
โข Proposition 6.41 (Valid adjustment sets) The following ๐'s are v.a.s's for ๐, ๐ where ๐, ๐ โ ๐ and ๐ โ pa๐:1. โParent adjustmentโ: ๐ โ pa๐;2. โBackdoor criterionโ (Pearl's admissible/sufficient set): Any ๐ โ ๐\ ๐, ๐ with:โข ๐ contains no descendant of ๐, andโข ๐ blocks all paths from ๐ to ๐ entering ๐ through the backdoor (๐ โ โฏ๐);3. โToward necessityโ (characterize all v.a.s) [Shpitser et al., 2010; Perkovic et al., 2015]: Any ๐ โ ๐\ ๐, ๐ with:
โข ๐ contains no nodes (๐ excluded) on a directed path from ๐ to ๐ nor their descendants, andโข ๐ blocks all non-directed paths from ๐ to ๐.
1. โParent adjustmentโ: ๐ โ pa๐;2. โBackdoor criterionโ (Pearl's admissible/sufficient set): Any ๐ โ ๐\ ๐, ๐ with:โข ๐ contains no descendant of ๐, andโข ๐ blocks all paths from ๐ to ๐ entering ๐ through the backdoor (๐ โ โฏ๐);3. โToward necessityโ (characterize all v.a.s): Any ๐ โ ๐\ ๐, ๐ with:โข ๐ contains no nodes (๐ excluded) on a directed path from ๐ to ๐ nor their descendants, andโข ๐ blocks all non-directed paths from ๐ to ๐.
Causal Inference
Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:
โข The three rules of do-calculus: consider disjoint subsets ๐, ๐, ๐,๐ from DAG ๐ข.1. โInsertion/deletion of observationsโ: ๐โญ;๐๐ ๐โ๐ฑ ๐ฒ|๐ณ,๐ฐ = ๐โญ;๐๐ ๐โ๐ฑ ๐ฒ|๐ฐ ,
if ๐ and ๐ are d-separated by ๐,๐ in a graph where incoming edges in ๐ have been removed.2. โAction-observation exchangeโ: ๐โญ;๐๐ ๐โ๐ฑ,๐โ๐ณ ๐ฒ|๐ฐ = ๐โญ;๐๐ ๐โ๐ฑ ๐ฒ|๐ณ,๐ฐ ,
if ๐ and ๐ are d-separated by ๐,๐ in a graph where incoming edges in ๐ and outgoing edges from ๐ have been removed.
3. โInsertion/deletion of actionsโ: ๐โญ;๐๐ ๐โ๐ฑ,๐โ๐ณ ๐ฒ|๐ฐ = ๐โญ;๐๐ ๐โ๐ฑ ๐ฒ|๐ฐ ,if ๐ and ๐ are d-separated by ๐,๐ in a graph where incoming edges in ๐ and ๐(๐) have been removed. Here, ๐(๐) is the subset of nodes in ๐ that are not ancestors of any node in ๐ in a graph that is obtained from ๐ข after removing all edges into ๐.
Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:
โข Theorem 6.45 (Do-calculus).1. The three rules are complete: all identifiable intv. distrs. can be computed by an iterative
application of these three rules [Huang and Valtorta, 2006, Shpitser and Pearl, 2006] (more general than v.a.s).
2. There is an algorithm [Tian, 2002] that is guaranteed [Huang and Valtorta, 2006, Shpitser and Pearl, 2006] to find all identifiable intv. distrs.
3. There is a nec. & suff. graphical criterion for identifiability of intv. distrs. [Shpitser and Pearl, 2006, Corollary 3], based on so-called hedges [see also Huang and Valtorta, 2006].
Causal InferenceDo-calculus: alternative way to estimate an interventional distribution from observations:
โข Example 6.46 (Front-door adjustment).
โข There is no v.a.s if we do not observe ๐.โข But ๐โญ;๐๐ ๐โ๐ฅ ๐ฆ is identifiable using the do-calculus if ๐โญ ๐ง ๐ฅ > 0:๐โญ;๐๐ ๐โ๐ฅ ๐ฆ = ฯ๐ง ๐
โญ ๐ง ๐ฅ ฯ๐ฅโฒ ๐โญ ๐ฆ|๐ฅโฒ, ๐ง ๐โญ ๐ฅโฒ .
โข Observing ๐ in addition to ๐ and ๐ reveals causal information nicely.โข Causal relations can be explored by observing the โchannelโ ๐ that carries the โsignalโ from ๐ to ๐.
Causal DiscoveryLearn a causal model (causal graph of an SCM) from data.
โข In most cases, only observational data is available: Observational causal discovery.
Two main-stream methods:
โข Conditional Independence (CI) / constraint based methods (PC algorithm).
โข Directional methods (additive noise models).
Causal DiscoveryCI-based methods.
โข Assumption:โข ๐๐ is Markovian (makes CI assertions as the graph does, in one of the three equiv. forms) and
faithful (makes no more CI assertions than the graph does).
โข The Markov equivalence class is then identifiable (Lemma 7.2; Spirtes et al., 2000).โข Definition 6.24 (Markov equivalence of graphs): Two DAGs are said to be Markov. equiv., if they have
the same set of distributions Markovian w.r.t them. Equivalently, they have the same set of d-separation.
โข Lemma 6.25 (Graphical criteria for Markov equivalence; Verma and Pearl [1991]; Frydenberg [1990]) Two DAGs are Markov equiv., iff. they have the same skeleton and the same immoralities (v-structures).
Causal DiscoveryCI-based methods.
โข Instances:โข Inductive causation (IC) algorithm [Pearl, 2009].
โข SGS algorithm (Spirtes, Glymour, Scheines) [Spirtes et al., 2000].
โข PC algorithm (Peter Spirtes & Clark Glymour) [Spirtes et al., 2000].
Causal DiscoveryCI-based methods.
โข Algorithm:โข Estimation of Skeleton:โข Lemma 7.8 [Verma and Pearl, 1991, Lemma 1] (i) Two nodes ๐, ๐ in a DAG are adjacent, iff. they cannot
be d-separated by any subset ๐ โ ๐\ ๐, ๐ . (ii) If two nodes ๐, ๐ in a DAG are not adjacent, then they are d-separated by either pa๐ or pa๐.
โข IC/SGS: For each pair of nodes ๐, ๐ , search through all such ๐'s. ๐, ๐ are adjacent iff. they are cond. dep. given any one of such ๐'s (i).
โข PC: efficient search for ๐. Start with a fully connected undirected graph, and traverse over such ๐'s in an order of increasing size. For size ๐, one only has to go through ๐'s that are subsets either of the neighbors of ๐ or of the neighbors of ๐ (inverse negative of (ii)).
Causal DiscoveryCI-based methods.
โข Algorithm:โข Orientation of Edges:โข For a structure ๐ โ ๐ โ ๐ with no edge betw. ๐, ๐ in the skeleton, let ๐ด be a set that d-separates ๐ and ๐. Then the structure can be oriented as ๐ โ ๐ โ ๐, iff. ๐ โ ๐ด.
โข More edges can be oriented in order to, e.g., avoid cycles.
โข Meekโs orientation rules [Meek, 1995]: a complete set of orientation rules.
Causal DiscoveryAdditive Noise Models (ANM).
โข Philosophy:Define a family of โparticularly naturalโ conditionals, such that if a conditional can be described using this family, then the conditional in the opposite direction cannot be covered by this family.
โข Determines a causal direction even for two variables with observational data.
โข Definition 7.3 (ANMs) We call an SCM โญ an ANM if the structural assignments are of the form ๐๐ โ ๐๐ pa๐ +๐๐ , ๐ = 1,โฆ , ๐. Further assume that ๐๐ 's are differentiable and ๐๐ 's have strictly positive densities.
โข Recall that in defining SCM, the ๐๐โs are required to be jointly independent, in respect to the independent mechanism principle.
Causal DiscoveryAdditive Noise Models (ANM).
โข Philosophy:
Causal DiscoveryAdditive Noise Models (ANM).
โข All the non-identifiable cases of ANMs [Zhang & Hyvarinen, 2009, Thm. 8; Peters et al., 2014, Prop. 23]:Consider the bivariate ANM ๐ โ ๐, ๐ โ ๐ ๐ + ๐, where ๐ โฅ ๐ and ๐ is three times differentiable. Assume that ๐โฒ ๐ฅ log ๐๐
โฒโฒ ๐ฆ = 0 only at finitely many points ๐ฅ, ๐ฆ . If there is a backward ANM ๐ โ ๐, then one of the followings must hold:
๐ ๐ ๐
(i) Gaussian Gaussian linear
(ii) log-mix-lin-exp log-mix-lin-exp linear
(iii), (iv) log-mix-lin-exp one-sided asymptotically exponential, or generalized mixture of two exponentials.
strictly monotonic with lim
๐ฅโโor โโ๐โฒ ๐ฅ = 0.
(v) generalized mixture of two exponentials
two-sided asymptotically exponential strictly monotonic with lim
๐ฅโโor โโ๐โฒ ๐ฅ = 0.
Causal DiscoveryCommon choices of ANMs:
โข In multivariate case, even linear Gaussian with equal noise variances is identifiable [Peters and Buhlmann, 2014].
โข Linear Non-Gaussian Acyclic Models (LiNGAMs) [Thm. 7.6; Shimizu et al., 2006].
โข Nonlinear Gaussian Additive Noise Models [Thm. 7.7; Peters et al., 2014, Cor. 31].
Causal DiscoveryLearning causal graph with ANMs.
โข Score-based methods.โข Maximize likelihood using a greedy search technique (for optimization over the graph space).
โข For nonlinear Gaussian ANMs [Buhlmann et al., 2014]:โข Given a current graph ๐ข, regress each var. on its parents and obtain the score log ๐ ๐ ๐ขโ log ๐ ๐ ๐ข, แ๐ = โฯ๐=1
๐ เทvar ๐ ๐ , where ๐ ๐ โ ๐๐ โ แ๐๐ pa๐ is the residual.
โข When computing the score of a neighboring graph, utilize the decomposition and update only the altered summand.
โข It can be shown to be consistent [Buhlmann et al., 2014].
โข If the noise cannot be assumed to be Gaussian, one can estimate the noise distr. [Nowzohour and Buhlmann, 2016] and obtain an entropy-like score.
Causal DiscoveryLearning causal graph with ANMs.
โข RESIT [Mooij et al., 2009; Peters et al., 2014]:
Phase 1: Based on the fact that for each node ๐๐ the corresp. ๐๐ is indep. of all non-desc. of ๐๐. Particularly, for each sink node ๐๐, ๐๐ โฅ ๐\ ๐๐ .
Phase 2: visit every node and eliminate incoming edges until the residuals are not indep. anymore.
Causal DiscoveryLearning causal graph with ANMs.
โข Independence-Based Score [Peters et al., 2014]:โข RESIT is asymmetric, thus mistakes will propagate and accumulate over the iterations.
โข Intuition: (i) under ๐ข0, all the residuals are indep. of each other; (ii) causal minimality helps identifying a unique DAG:
โข res๐๐ข,RM
are the residuals of node ๐๐ when it is regressed on its parents specified in ๐ข using regression method RM.
โข DM denotes a dependence measure. The second term encourages causal minimality.
โข ICA-based methods for LiNGAMs [Shimizu et al., 2006; 2011].
Causal Discoveryโข CI-based methods.โข More faithful to data.
โข Identif. only up to a Markov equiv. class.
โข CI test methods may be tricky.
โข ANMs.โข A relative strong assumption/belief
(a family is particularly natural for modeling conditionals).
โข Identification within a Markov equiv. class.
โข Leveraging intervened datasets.โข Based on the invariance principle of causality.
โข Identification within a Markov equiv. class.
โข Stronger requirement on data.
โข [Peters et al., 2016; Romeijn & Williamson, 2018;Bengio et al., 2019].
Thanks!