causal modeling with generative neural networks · 2017. 10. 30. · variational autoencoder...

Causal Modeling with Generative Neural Networks

Michele SebagTAO, CNRS − INRIA − LRI − Universite Paris-Sud

Joint work: D. Kalainathan, O. Goudet, I. Guyon,M. Hajaiej, A. Decelle, C. Furtlehner

https://arxiv.org/abs/1709.05321Credit for slides: Yann LeCun

Leiden − Sept. 2017

1 / 27

Motivation

State of art

Causal Generative Neural Nets

Naive ML Approach to SW

2 / 27

ML: discriminative or generative modellingGiven a training set usually iid samples ∼ P(X ,Y )

E = {(xi , yi ), xi ∈ IRd , i ∈ [[1, n]]}

I Supervised learning: h : X 7→ Y or P(Y |X )

I Generative model P(X ,Y )

Predictive modelling might be based on correlationsIf umbrellas in the street, Then it rains

3 / 27

The big data promise:ML models will expectedly support interventions:

I health and nutrition

I education

I economics/management

I climate

Intervention Pearl 2009

Intervention do(X = x) forces variables X to value x

Direct cause Xi → Xj

PXj |do(Xi=x,X\ij=c) 6= PXj |do(Xi=x′,X\ij=c)

ExampleC: Cancer, S : Smoking, G : Genetic factorsP(C |do{S = 0,G = 0}) 6= P(C |do{S = 1,G = 0})

4 / 27

Correlations do not support interventions

Causal models are needed to support interventions

5 / 27

Why is this relevant to space weather ?

Causal models support understanding

Causal models are more robust e.g., to concept drift

I Given observations drawn after P(X ), P(Y |X ),

Find P(Y |X ) that minimizes

IEx∼P(X )

[arg max

yP(y |x)− arg max

yP(y |x)

]I But P(X ) in production might differ from P(X ) in training

6 / 27

Causal modelling, how

Historically, based on interventions. However, often

I impossible climate

I unethical make people smoking

I too expensive e.g., in economics

Machine Learning alternatives

I Observational data

I Statistical tests

I Learned models

I Prior knowledge / Assumptions / Constraints

7 / 27

Motivation

State of art

8 / 27

Functional Causal Models, a.k.a. Structural Equation Models

Xi = fi (Pa(Xi ),Ei )

Pa(Xi ): Direct causes for Xi All unobserved influences: noise variables Ei

X1 = f1(E1)

X2 = f2(X1,E2)

X3 = f3(X1,E3)

X4 = f4(E4)

X5 = f5(X3,X4,E5)

TasksI Finding the structure of the graph (no cycles)I Finding functions (fi )

9 / 27

Conducting a causal modelling study

Milestones

I Testing bivariate independence (statistical tests)find edges X − Y ;Y − Z

I Conditional independenceprune the edges X ⊥⊥ Z |Y

I Full causal graph modellingorient the edges X → Y → Z

Challenges

I Computational complexity tractable approximation

I Conditional independence: data hungry tests

I Assuming causal sufficiency can be relaxed

10 / 27

X − Y independance

P(X ,Y ) =?P(X ).P(Y )

Categorical variables

I Entropy H(X ) = −∑

x p(x)log(p(x))x : value taken by X , p(x) its frequency

I Mutual information M(X ,Y ) = H(X ) + H(Y )− H(X ,Y )

I Others: χ2, G-test

Continuous variables

I t-test, z-test

I Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al., 05

Cov(f , g) = IEx,y [f (x)g(y)]− IEx [f (x)]IEy [g(y)]

I Given f : X 7→ IR and g : Y 7→ IRI Cov(f , g) = 0 for all f , g iff X and Y are independent

11 / 27

An ML approachGuyon et al, 2014-2015

E = {(Ai ,Bi , `i ), `i in {→,←,⊥⊥}}

12 / 27

Exploiting the distribution asymmetry

Hoyer et al. 09; Mooij et al. 2016

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X

Y Y=f(X)

X=g(Y)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X

f(X) - Y

3 2 1 0 1 2 3Y

g(Y) - X

True model with noise ε independent on X

Y = X + ε

Learn Y = f (X ), plot the residual Y − f (X )Learn X = g(Y ), plot the residual X − g(Y )

13 / 27

Exploiting the asymmetry, 2

Given A, B

14 / 27

Given A, B, Learn

I A = f (B)

I B = g(A)

Retain model with best fit: A→ B

15 / 27

Given A, B, Learn

I A = f (B)

I B = g(A)

Retain model with best fit: A→ B

A: Altitude of city, B: Temperature

15 / 27

Find V-structure: A ⊥⊥ C and A 6⊥⊥ C |B

Explaining away causes

16 / 27

Motivation

State of art

17 / 27

Auto-Encoders

Training setE = {(xi ), xi ∈ IRd , i = 1 . . . n}

Structure of Auto-Encoder

Minimization of Mean Squared Error (MSE)

Minimize∑i

||xi − x′i ||2

Output: z, a compressed representation of x

18 / 27

Stacked Auto-Encoders

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Differences

I Several hidden layers

I Minimize MSE or cross-entropy loss

Minimize∑i,j

xi,j log xi,j + (1− xi,j) log (1− xi,j)

19 / 27

Variational Auto-Encoders

Kingma et al. 13

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Difference

I Hidden layer: parameters of a distribution N (µ, σ2)

I Distribution used to generate values z = µ+ σ ×N (0, 1)

20 / 27

Variational Auto-Encoders

Kingma et al. 13

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

Difference

I Hidden layer: parameters of a distribution N (µ, σ2)

I Distribution used to generate values z = µ+ σ ×N (0, 1)

21 / 27

Causal Generative Neural NetsGoudet et al. 17

E = {(xi ), xi ∈ IRd , i = 1 . . . n}

E ′ = {(x′i ), x′i ∈ IRd , i = 1 . . . n′}

I Train the generator to minimize the “distance” between original andgenerated data in IRd

MMD(G) =1

∑i,j

k(xi , xj) +1

∑i,j

k(x′i , x′j)− 2

∑i,j

k(xi , x′j)

k(x, z) =∑i

exp−γid||x−z||2

γi in {10−2 . . . 102}

22 / 27

Relaxing the causal sufficiency assumption

X2 = f2(E2,E2,3)

X3 = f3(E3,E2,3,E3,5)

X4 = f4(E4,E4,5)

X5 = f5(X3,X4,E5,E3,5,E4,5)

23 / 27

Graph inference

Results: Area under the precision/recall curveAlgorithm G2 G3 G4Constraint-basedPC-Gaussian 82.3 ±4 (87.8) 80.0 ±7 (89.2) 88.1 ±10 (95.7)PC-HSIC 93.4 ±3 (78.5) 93.0 ±4 (77.9) 98.9 ±2 (88.0)Score-basedGES 75.3 ±7 (81.2) 73.6 ±7 (77.7) 69.3±11 (78.6)Pairwise orientationLiNGAM 64.4 ±4 (100) 71.1 ±1 (100) 71.6 ±7 (100)ANM 72.9 ±9 (100) 72.5 ±4 (100) 79.9 ±5 (100)Jarfo 69.9 ±9 (100) 87.3 ±3 (100) 88.5 ±5 (100)CGNN-Fourier 94.5 ±2 (100) 84.9 ±9 (100) 93.6 ±3 (100)CGNN-MMD 96.9 ±1 (100) 96.5 ±3 (100) 97.2 ±3 (100)

Python framework available at:https://github.com/Diviyan-Kalainathan/CausalDiscoveryToolbox

Caveat: up to 50 variables

24 / 27

Motivation

State of art

25 / 27

Compact solar state

representations

Principle

Image preprocessing

Autoencoders

• Dimensionality reduction

• Input and Output

similarity

• Bottleneck

• 256x256 → 512

Autoencoders

similarity

• Bottleneck

• 256x256 → 512

Autoencoders

similarity

• Bottleneck

• 256x256 → 512

Autoencoders

similarity

• Bottleneck

• 256x256 → 512

Autoencoders

• 512x512 → 512

Autoencoders

• 256x256 → 64

Variational Autoencoder

• Assumption on the

latent space distribution

• 256x256 → 90

Autoencoders training

• Intermediate image size

• Custom loss : loss =(ytrue−ypred )

(ytrue+ε)α +(ytrue−ypred )

(1−ytrue+ε)α

Autoencoders training

• Intermediate image size

• Custom loss : loss =(ytrue−ypred )

(ytrue+ε)α +(ytrue−ypred )

(1−ytrue+ε)α

Results

Autoencoder Conv Conv + Dense Conv + PCA Variational

Reduction rate 1/128 1/1024 1/524 1/728

• Visual similarity

• Smoothness over time

• Classification for verification

Results

Reduction rate 1/128 1/1024 1/524 1/728

Results

Reduction rate 1/128 1/1024 1/524 1/728

Results

Event precision recall accuracy F1-score

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Pseudo streamer 0.66 0.93 0.78 0.77

Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest

• Only 8000 labeled images

• Time distribution

• Prediction at L1

• Low performances

• Let’s extract more information

Results

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Results

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Results

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Results

Coronal hole 0.74 0.36 0.62 0.48

Lepping 0.90 0.51 0.77 0.65

Going further

Classification of solar events

I More data

I Caveat: the train/test split

Predicting data at L1

I the propagation time from sun to L1

I help needed !

26 / 27

Thanks

I Olivier Goudet, Diviyan Kalainathan, Isabelle Guyon, Aris Tritas

I Mhamed Hajaiej, Cyril Furtlehner, Aurelien Decelle

27 / 27

causal modeling with generative neural networks · 2017. 10. 30. · variational autoencoder...

Documents

autoregressive models · 2019-12-09 · autoregressive...

알기쉬운 variational autoencoder

autoencoders - university at buffalo

adversarially regularized...

the mutual autoencoder: [5mm] - controlling information in...

autoencoders - tj machine learning · 2021. 2. 24. ·...

lecture 10: autoencoders

how to train deep variational autoencoders and ... · the...

winner-take-all autoencoders

14.2 denoising autoencoders -...

summary of several autoencoder models - github...

autoencoders& kernels

autoencoders - purdue university

variational autoencoders for deforming 3d mesh...

variational autoencoders for sparse and overdispersed...

grammar variational autoencoder (gvae) syntax-directed...

variational attention for sequence-to-sequence models ·...

introduction to variational autoencoders · introduction to...

autoencoders for image_classification

autoencoderscse.iitkgp.ac.in/~sudeshna/courses/dl17/autoencoder-15-mar-17.pdf ·...