what should an active causal learner value?bramley/posters/psychonomics2014.pdf · what should an...
TRANSCRIPT
What should an active causal learner value?Neil R. Bramley1, Jonathan D. Nelson2, Maarten Speekenbrink1, Vincenzo Crupi3, & David A. Lagnado1
[email protected], [email protected], [email protected], [email protected], [email protected] College London1 Max Planck Institute for Human Development2 University of Turin3
What is active causal learning?
• Intervening on a causal system provides important information about causal structure(Pearl, 2000; Sprites et al., 1993; Steyvers, 2003; Bramley et al., 2014)• Interventions render intervened-on variables independent of their normal causes:P (A|Do[B], C) 6= P (A|B,C)•Which interventions are useful depends on the prior and the hypothesis space•Goal: Maximize probability of identifying the correct graph
+
−
+? ?
?
? ??
+? ?
?
+
−? ?
?
? ?
?2. Choose an intervention
1. Reach into the system
3. Observe the outcome
Example 1 Example 2
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
0 0.03 0.06 0.09 0.13 0.16
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
A
B C
0 0.03 0.06 0.09 0.13 0.16
4. Update beliefs
Prior Posterior
5. Repeat
Strategies for choosing interventions myopically
Probability gain Choose intervention i ∈ I that maximises expected maximum probability φin posterior over causal graphs G ∈ GE[φ(G|i)] =
∑o∈O φ(G|i, o)p(o|G, i) where φ(G|i, o) = maxg∈G p(g|i, o)
Information gain Choose intervention that minimises expected Shannon entropyE[H(G|I)] =
∑o∈OH(G|i, o)p(o|G, I) where −
∑g∈GH(g|I, o) = p(g|I, o) log2 p(g|i, o)
Can information gain beat probability gain at probability gain?
Bramley et al. (2014), test performance learning all 25 of the 3-variable graphs in an environ-ment with spontaneous activation rate p = .1 and causal power p = .8:
2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
Performance by simulated strategy
Number of interventions performed
Mea
n pr
obab
ility
cor
rect
●
●
●
●
●
●
●●
●● ●
● ●
2 4 6 8 10 12
01
23
45
Performance by simulated strategy
Number of interventions performed
Mea
n S
hann
on in
form
atio
n ga
ined
●
●
●
●
●
●●
●●
● ●● ●
● Probability gainInformation gainRandom interventions
Beyond Shannon: Generalized entropy landscapes
Entropy is expected surprise across possible outcomes. Shannon entropy uses the normalmean (e.g.
∑x Surprise(x)×p(x)) as expectation and log(1/p(x)) for surprise. Different types
of mean, and different types of surprise, can be used (Crupi and Tentori, 2014). Sharma-Mittal entropy (Nielsen and Nock, 2011) provides a unifying framework:
• Sharma-Mittal entropy: Hα,β(p) =1
1−β
((∑i p(x)
α)1−β1−α − 1
), where α > 0, α 6= 1, and β 6= 1
•Renyi (1961) entropy if degree (β) = 1• Tsallis (1988) entropy if degree (β) = order (α)• Error entropy (probability gain; Nelson et al. (2010)) if degree (β) = 2, order (α) = ∞
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.11 0.11
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.1 0.1
0.11
0.11
0.11
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.09 0.0
9
0.1
0.11
0.11
0.11
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.07
0.07
0.0
7
0.09
0.09
0.0
9
0.1
0.1
0.1
0.11
0.11
0.1
1
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.5 0.5 0.55 0.6 0.65
p(x3
) = 0
p(x2) = 0
p(x1) = 0 0.4
0.4
0.4 0.45 0.45 0.5
0.5
0.5 0.55 0.6
0.65
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.3
0.3
0.3 0.4
0.4
0.4
0.45
0.45
0.4
5
0.5
0.5
0.55
0.6
0.65
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.15 0.1
5
0.2
0.25 0.2
5
0.3
0.3
0.3
0.35
0.35
0.35
0.4
0.4
0.4
0.45
0.45
0.45 0.5
0.5
0.5
0.55
0.6
0.65
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.1
0.1
0.1
0.2
0.2
0.2
0.25
0.25
0.2
5
0.3
0.3
0.3
0.35
0.35
0.3
5
0.4
0.4
0.4
0.45
0.45
0.4
5 0.5
0.5
0.5
0.55 0.6
0.65
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.7 0.8 0.9 1
1
1
1
1
1
1 1
1 1
p(x3
) = 0
p(x2) = 0
p(x1) = 0 0.5
0.5
0.5 0.6 0.6 0.7
0.7
0.8 0.9
1
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.3
0.3
0.3
0.5
0.5
0.5 0.6
0.6
0.6 0.7
0.7
0.7 0.8
0.9
1
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.3 0.3
0.4
0.4
0.4
0.5
0.5
0.5
0.6
0.6
0.6 0.7
0.7 0.8
0.9
1
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.1
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5
0.5
0.5
0.6
0.6
0.6
0.7 0.8
0.9 1
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.8 0.8 1 1.2
1.2
1.4
p(x3
) = 0
p(x2) = 0
p(x1) = 0 0.6
0.6
0.6 0.8
0.8
0.8 1 1.2
1.4
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.4
0.4
0.4 0.6
0.6
0.6 0.8
0.8
0.8 1
1.2
1.4
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.4
0.4
0.4 0.6
0.6
0.6 0.8
0.8
0.8 1
1.2
1.4
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
0.8
1 1.2
1.4
p(x3
) = 0
p(x2) = 0
p(x1) = 0
1 1.2 1.4 1.6
1.6
1.8
p(x3
) = 0
p(x2) = 0
p(x1) = 0 0.6 0.6 0.8
0.8
0.8 1
1
1.2 1.4
1.6
1.8
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.4
0.4
0.4 0.6
0.6
0.6 0.8
0.8
0.8 1
1
1.2 1.4
1.6
1.8
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6 0.8
0.8
0.8 1
1.2 1.4
1.6
1.8 p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.4
0.4
0.4
0.6
0.6
0.6
0.8
0.8
0.8
1 1.2
1.4 1.6
1.8
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.35
0.4
0.45
0.5
0.5
0.55
0.55
0.5
5
0.6
0.6
0.6
0.65
0.65
0.6
5
0.7
0.7
0.7
0.75
0.75
0.7
5
0.8
0.8
0.8
0.85
0.85
0.8
5
0.9
0.9
0.9
0.95
0.95
0.9
5
p(x3
) = 0
p(x2) = 0
p(x1) = 0
Renyi
Tsallis
Shannon
Order (α).5 1 2 10
.5
1
2
10
In�nity
Probability gain
Deg
ree
(β)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
Generalised entropy
Probability of element
Ele
men
twis
e co
ntrib
utio
n to
ent
ropy
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
Generalised surprise values
Probability of element
Ato
mic
sur
pris
e
Tsallis 0.1 Tsallis 0.5Tsallis 1 (Shannon surprise)Tsallis 2 (Quadratic surprise)Tsallis 10
.1
.1
Weird and wonderful (bad) entropy functions
To further broaden the space of entropy functions considered, we first created a variety ofatomic surprise functions. From these we derived corresponding entropy functions, alwaysdefining entropy as
∑x Surprise(x) × p(x). The resulting entropy functions vary in which
entropy axioms (Csiszar, 2008) they satisfy.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
'Bad' atomic surprise functions
Probability of element
Ato
mic
sur
pris
e
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
'Bad' elementwise entropies
Probability of element
Ele
men
tal c
ontr
ibut
ion
to e
ntro
py
Central kinkEarly kinkLate kinkThree stepFive step
0.1
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5
0.5
0.5
0.6
0.7
0.8
0.9
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.14
0.1
4
0.14
0.1
4
0.14
0.14
0.14 0.1
4
0.1
6
0.16
0.16
0.18
0.18
0.18
0.18
0.18
0.18
0.1
8
0.2
0.2
0.22
0.22
0.24
0.24
0.24
0.24
0.24
0.24 0.26
0.26
0.26 0.28
0.28
0.28
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5 0
.5
0.6
0.6
0.6
0.7
0.7
0.7
0.8
0.8
0.8
0.9
0.9
0.9
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.1
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5
0.5
0.5
0.6
0.6
0.6
0.7
0.7
0.7
0.8
0.8
0.8
p(x3
) = 0
p(x2) = 0
p(x1) = 0
0.1
0.1
0.1
0.1
0.1
0.1
0.2
0.2
0.2
0.2
0.2 0
.2
0.3
0.4
0.4
0.5
0.5 0.6
0.6
0.6 p(x3
) = 0
p(x2) = 0
p(x1) = 0
Central kink Early kink Late kink Three random steps Five random steps
Simulating many strategies in many environments
• Learners based on 5 × 5 entropies in Sharma-mittal space α, β ∈ [1/10, 1/2, 1, 2, 10], plusour 5 weird and wonderful entropy functions.• Test cases: all 25 of the 3-variable causal graphs, through 8 sequentially chosen interven-
tions. Repeated simulations 10 times each in 3× 3 environments:1. Spontaneous activation rates of [.2, .4, .6]2. Causal powers of [.4, .6, .8]
1 2 3 4 5 6 7 8
0.10
0.15
0.20
0.25
0.30
0.35
Generalised entropy simulations
N interventions performed
Pro
babi
lity
corr
ect
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
a=1/10 b=1/10 (Tsallis low)a=1/10, b=1 (Renyi low)a=10, b=10 (Tsallis high)a=10, b=1 (Renyi high)
a=1, b=1 (Shannon)a=Inf, b=2 (Probability gain)
0 2 4 6 8
0.05
0.10
0.15
0.20
0.25
0.30
0.35
'Bad' entropy simulations
N interventions performed
Pro
babi
lity
corr
ect
●
●
●
●
●
●
●●
●
● ● ●●
●
●
●
●
●
●
●
Central kinkEarly kinkLate kink3 step5 step
Results
• Shannon entropy achieves higher accuracy than probability gain•many entropy functions achieve similar accuracy as Shannon entropy• but the interventions chosen vary widely:
Human data Probability gain Shannon Tsallis low Renyi low Tsallis high Renyi high Central kink Early kink Late kink Three step Five step
Intervention choice type proportions (N = 18000)
0.0
0.2
0.4
0.6
0.8
1.0
ObservationOne−onOne−on, one−offTwo−onOne−offTwo−off
Questions:
• in which instances do different strategies strongly contradict each other?•which properties of entropy functions are important for particular situations?•when are particular entropy functions cognitively plausible?
ReferencesBramley, N. R., Lagnado, D. A., and Speekenbrink, M. (2014). Forgetful conservative scholars - how people learn causal structure
through interventions. Journal of Experimental Psychology: Learning, Memory and Cognition, doi:10.1037/xlm0000061.
Crupi, V. and Tentori, K. (2014). State of the field: Measuring information and confirmation. Studies in History and Philosophy ofScience Part A, 47:81–90.
Csiszar, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3):261–273.
Nelson, J. D., McKenzie, C. R., Cottrell, G. W., and Sejnowski, T. J. (2010). Experience matters information acquisition optimizesprobability gain. Psychological science, 21(7):960–969.
Nielsen, F. and Nock, R. (2011). A closed-form expression for the sharma-mittal entropy of exponential families. arXiv preprintarXiv:1112.4221.
Pearl, J. (2000). Causality. Cambridge University Press (2nd edition), New York.
Renyi, A. (1961). On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability,pages 547–561.
Sprites, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search (Springer Lecture Notes in Statistics). MITPress. (2nd edition, 2000), Cambridge, MA.
Steyvers, M. (2003). Inferring causal networks from observations and interventions. Cognitive Science, 27(3):453–489.
Tsallis, C. (1988). Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52(1-2):479–487.
1