causal modeling with generative neural networks · 2017. 10. 30. · variational autoencoder...
Post on 01-Sep-2020
12 Views
Preview:
TRANSCRIPT
Causal Modeling with Generative Neural Networks
Michele SebagTAO, CNRS − INRIA − LRI − Universite Paris-Sud
Joint work: D. Kalainathan, O. Goudet, I. Guyon,M. Hajaiej, A. Decelle, C. Furtlehner
https://arxiv.org/abs/1709.05321Credit for slides: Yann LeCun
Leiden − Sept. 2017
1 / 27
Motivation
State of art
Causal Generative Neural Nets
Naive ML Approach to SW
2 / 27
ML: discriminative or generative modellingGiven a training set usually iid samples ∼ P(X ,Y )
E = {(xi , yi ), xi ∈ IRd , i ∈ [[1, n]]}
Find
I Supervised learning: h : X 7→ Y or P(Y |X )
I Generative model P(X ,Y )
Predictive modelling might be based on correlationsIf umbrellas in the street, Then it rains
3 / 27
The big data promise:ML models will expectedly support interventions:
I health and nutrition
I education
I economics/management
I climate
Intervention Pearl 2009
Intervention do(X = x) forces variables X to value x
Direct cause Xi → Xj
PXj |do(Xi=x,X\ij=c) 6= PXj |do(Xi=x′,X\ij=c)
ExampleC: Cancer, S : Smoking, G : Genetic factorsP(C |do{S = 0,G = 0}) 6= P(C |do{S = 1,G = 0})
4 / 27
Correlations do not support interventions
Causal models are needed to support interventions
5 / 27
Why is this relevant to space weather ?
Causal models support understanding
Causal models are more robust e.g., to concept drift
I Given observations drawn after P(X ), P(Y |X ),
Find P(Y |X ) that minimizes
IEx∼P(X )
[arg max
yP(y |x)− arg max
yP(y |x)
]I But P(X ) in production might differ from P(X ) in training
6 / 27
Causal modelling, how
Historically, based on interventions. However, often
I impossible climate
I unethical make people smoking
I too expensive e.g., in economics
Machine Learning alternatives
I Observational data
I Statistical tests
I Learned models
I Prior knowledge / Assumptions / Constraints
7 / 27
Motivation
State of art
Causal Generative Neural Nets
Naive ML Approach to SW
8 / 27
Functional Causal Models, a.k.a. Structural Equation Models
Xi = fi (Pa(Xi ),Ei )
Pa(Xi ): Direct causes for Xi All unobserved influences: noise variables Ei
X1 = f1(E1)
X2 = f2(X1,E2)
X3 = f3(X1,E3)
X4 = f4(E4)
X5 = f5(X3,X4,E5)
TasksI Finding the structure of the graph (no cycles)I Finding functions (fi )
9 / 27
Conducting a causal modelling study
Milestones
I Testing bivariate independence (statistical tests)find edges X − Y ;Y − Z
I Conditional independenceprune the edges X ⊥⊥ Z |Y
I Full causal graph modellingorient the edges X → Y → Z
Challenges
I Computational complexity tractable approximation
I Conditional independence: data hungry tests
I Assuming causal sufficiency can be relaxed
10 / 27
X − Y independance
P(X ,Y ) =?P(X ).P(Y )
Categorical variables
I Entropy H(X ) = −∑
x p(x)log(p(x))x : value taken by X , p(x) its frequency
I Mutual information M(X ,Y ) = H(X ) + H(Y )− H(X ,Y )
I Others: χ2, G-test
Continuous variables
I t-test, z-test
I Hilbert-Schmidt Independence Criterion (HSIC) Gretton et al., 05
Cov(f , g) = IEx,y [f (x)g(y)]− IEx [f (x)]IEy [g(y)]
I Given f : X 7→ IR and g : Y 7→ IRI Cov(f , g) = 0 for all f , g iff X and Y are independent
11 / 27
An ML approachGuyon et al, 2014-2015
E = {(Ai ,Bi , `i ), `i in {→,←,⊥⊥}}
12 / 27
Exploiting the distribution asymmetry
Hoyer et al. 09; Mooij et al. 2016
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X
3
2
1
0
1
2
3
Y Y=f(X)
X=g(Y)
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0X
1.5
1.0
0.5
0.0
0.5
1.0
1.5
f(X) - Y
3 2 1 0 1 2 3Y
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
g(Y) - X
True model with noise ε independent on X
Y = X + ε
Learn Y = f (X ), plot the residual Y − f (X )Learn X = g(Y ), plot the residual X − g(Y )
13 / 27
Exploiting the asymmetry, 2
Given A, B
14 / 27
Exploiting the asymmetry, 2
Given A, B, Learn
I A = f (B)
I B = g(A)
Retain model with best fit: A→ B
15 / 27
Exploiting the asymmetry, 2
Given A, B, Learn
I A = f (B)
I B = g(A)
Retain model with best fit: A→ B
A: Altitude of city, B: Temperature
15 / 27
Find V-structure: A ⊥⊥ C and A 6⊥⊥ C |B
Explaining away causes
16 / 27
Motivation
State of art
Causal Generative Neural Nets
Naive ML Approach to SW
17 / 27
Auto-Encoders
Training setE = {(xi ), xi ∈ IRd , i = 1 . . . n}
Structure of Auto-Encoder
Minimization of Mean Squared Error (MSE)
Minimize∑i
||xi − x′i ||2
Output: z, a compressed representation of x
18 / 27
Stacked Auto-Encoders
E = {(xi ), xi ∈ IRd , i = 1 . . . n}
Differences
I Several hidden layers
I Minimize MSE or cross-entropy loss
Minimize∑i,j
xi,j log xi,j + (1− xi,j) log (1− xi,j)
19 / 27
Variational Auto-Encoders
Kingma et al. 13
E = {(xi ), xi ∈ IRd , i = 1 . . . n}
Difference
I Hidden layer: parameters of a distribution N (µ, σ2)
I Distribution used to generate values z = µ+ σ ×N (0, 1)
20 / 27
Variational Auto-Encoders
Kingma et al. 13
E = {(xi ), xi ∈ IRd , i = 1 . . . n}
Difference
I Hidden layer: parameters of a distribution N (µ, σ2)
I Distribution used to generate values z = µ+ σ ×N (0, 1)
21 / 27
Causal Generative Neural NetsGoudet et al. 17
E = {(xi ), xi ∈ IRd , i = 1 . . . n}
E ′ = {(x′i ), x′i ∈ IRd , i = 1 . . . n′}
I Train the generator to minimize the “distance” between original andgenerated data in IRd
MMD(G) =1
n2
∑i,j
k(xi , xj) +1
n′2
∑i,j
k(x′i , x′j)− 2
1
nn′
∑i,j
k(xi , x′j)
k(x, z) =∑i
exp−γid||x−z||2
γi in {10−2 . . . 102}
22 / 27
Relaxing the causal sufficiency assumption
X2 = f2(E2,E2,3)
X3 = f3(E3,E2,3,E3,5)
X4 = f4(E4,E4,5)
X5 = f5(X3,X4,E5,E3,5,E4,5)
23 / 27
Graph inference
Results: Area under the precision/recall curveAlgorithm G2 G3 G4Constraint-basedPC-Gaussian 82.3 ±4 (87.8) 80.0 ±7 (89.2) 88.1 ±10 (95.7)PC-HSIC 93.4 ±3 (78.5) 93.0 ±4 (77.9) 98.9 ±2 (88.0)Score-basedGES 75.3 ±7 (81.2) 73.6 ±7 (77.7) 69.3±11 (78.6)Pairwise orientationLiNGAM 64.4 ±4 (100) 71.1 ±1 (100) 71.6 ±7 (100)ANM 72.9 ±9 (100) 72.5 ±4 (100) 79.9 ±5 (100)Jarfo 69.9 ±9 (100) 87.3 ±3 (100) 88.5 ±5 (100)CGNN-Fourier 94.5 ±2 (100) 84.9 ±9 (100) 93.6 ±3 (100)CGNN-MMD 96.9 ±1 (100) 96.5 ±3 (100) 97.2 ±3 (100)
Python framework available at:https://github.com/Diviyan-Kalainathan/CausalDiscoveryToolbox
Caveat: up to 50 variables
24 / 27
Motivation
State of art
Causal Generative Neural Nets
Naive ML Approach to SW
25 / 27
Compact solar state
representations
Principle
9
Image preprocessing
10
Autoencoders
• Dimensionality reduction
• Input and Output
similarity
• Bottleneck
• 256x256 → 512
11
Autoencoders
• Dimensionality reduction
• Input and Output
similarity
• Bottleneck
• 256x256 → 512
11
Autoencoders
• Dimensionality reduction
• Input and Output
similarity
• Bottleneck
• 256x256 → 512
11
Autoencoders
• Dimensionality reduction
• Input and Output
similarity
• Bottleneck
• 256x256 → 512
11
Autoencoders
• 512x512 → 512
12
Autoencoders
• 256x256 → 64
13
Variational Autoencoder
• Assumption on the
latent space distribution
• 256x256 → 90
14
Autoencoders training
• Intermediate image size
• Custom loss : loss =(ytrue−ypred )
2
(ytrue+ε)α +(ytrue−ypred )
2
(1−ytrue+ε)α
15
Autoencoders training
• Intermediate image size
• Custom loss : loss =(ytrue−ypred )
2
(ytrue+ε)α +(ytrue−ypred )
2
(1−ytrue+ε)α
15
Results
Autoencoder Conv Conv + Dense Conv + PCA Variational
Reduction rate 1/128 1/1024 1/524 1/728
• Visual similarity
• Smoothness over time
• Classification for verification
16
Results
Autoencoder Conv Conv + Dense Conv + PCA Variational
Reduction rate 1/128 1/1024 1/524 1/728
• Visual similarity
• Smoothness over time
• Classification for verification
16
Results
Autoencoder Conv Conv + Dense Conv + PCA Variational
Reduction rate 1/128 1/1024 1/524 1/728
• Visual similarity
• Smoothness over time
• Classification for verification
16
Results
Event precision recall accuracy F1-score
Coronal hole 0.74 0.36 0.62 0.48
Lepping 0.90 0.51 0.77 0.65
Pseudo streamer 0.66 0.93 0.78 0.77
Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest
• Only 8000 labeled images
• Time distribution
• Prediction at L1
• Low performances
• Let’s extract more information
17
Results
Event precision recall accuracy F1-score
Coronal hole 0.74 0.36 0.62 0.48
Lepping 0.90 0.51 0.77 0.65
Pseudo streamer 0.66 0.93 0.78 0.77
Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest
• Only 8000 labeled images
• Time distribution
• Prediction at L1
• Low performances
• Let’s extract more information
17
Results
Event precision recall accuracy F1-score
Coronal hole 0.74 0.36 0.62 0.48
Lepping 0.90 0.51 0.77 0.65
Pseudo streamer 0.66 0.93 0.78 0.77
Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest
• Only 8000 labeled images
• Time distribution
• Prediction at L1
• Low performances
• Let’s extract more information
17
Results
Event precision recall accuracy F1-score
Coronal hole 0.74 0.36 0.62 0.48
Lepping 0.90 0.51 0.77 0.65
Pseudo streamer 0.66 0.93 0.78 0.77
Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest
• Only 8000 labeled images
• Time distribution
• Prediction at L1
• Low performances
• Let’s extract more information
17
Results
Event precision recall accuracy F1-score
Coronal hole 0.74 0.36 0.62 0.48
Lepping 0.90 0.51 0.77 0.65
Pseudo streamer 0.66 0.93 0.78 0.77
Strahl 0.55 0.98 0.73 0.70* Random predictor performances are 0.625 for accuracy and 0.25 for the rest
• Only 8000 labeled images
• Time distribution
• Prediction at L1
• Low performances
• Let’s extract more information
17
Going further
Classification of solar events
I More data
I Caveat: the train/test split
Predicting data at L1
I the propagation time from sun to L1
I help needed !
26 / 27
Thanks
I Olivier Goudet, Diviyan Kalainathan, Isabelle Guyon, Aris Tritas
I Mhamed Hajaiej, Cyril Furtlehner, Aurelien Decelle
27 / 27
top related