what should an active causal learner value?bramley/posters/psychonomics2014.pdf · what should an...

What should an active causal learner value?Neil R. Bramley1, Jonathan D. Nelson2, Maarten Speekenbrink1, Vincenzo Crupi3, & David A. Lagnado1

[email protected], [email protected], [email protected], [email protected], [email protected] College London1 Max Planck Institute for Human Development2 University of Turin3

What is active causal learning?

• Intervening on a causal system provides important information about causal structure(Pearl, 2000; Sprites et al., 1993; Steyvers, 2003; Bramley et al., 2014)• Interventions render intervened-on variables independent of their normal causes:P (A|Do[B], C) 6= P (A|B,C)•Which interventions are useful depends on the prior and the hypothesis space•Goal: Maximize probability of identifying the correct graph

+

−

+? ?

?

? ??

+? ?

?

+

−? ?

?

? ?

?2. Choose an intervention

1. Reach into the system

3. Observe the outcome

Example 1 Example 2

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

0 0.03 0.06 0.09 0.13 0.16

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

A

B C

0 0.03 0.06 0.09 0.13 0.16

4. Update beliefs

Prior Posterior

5. Repeat

Strategies for choosing interventions myopically

Probability gain Choose intervention i ∈ I that maximises expected maximum probability φin posterior over causal graphs G ∈ GE[φ(G|i)] =

∑o∈O φ(G|i, o)p(o|G, i) where φ(G|i, o) = maxg∈G p(g|i, o)

Information gain Choose intervention that minimises expected Shannon entropyE[H(G|I)] =

∑o∈OH(G|i, o)p(o|G, I) where −

∑g∈GH(g|I, o) = p(g|I, o) log2 p(g|i, o)

Can information gain beat probability gain at probability gain?

Bramley et al. (2014), test performance learning all 25 of the 3-variable graphs in an environ-ment with spontaneous activation rate p = .1 and causal power p = .8:

2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

Performance by simulated strategy

Number of interventions performed

Mea

n pr

obab

ility

cor

rect

●

●

●

●

●

●

●●

●● ●

● ●

2 4 6 8 10 12

01

23

45

Performance by simulated strategy

Number of interventions performed

Mea

n S

hann

on in

form

atio

n ga

ined

●

●

●

●

●

●●

●●

● ●● ●

● Probability gainInformation gainRandom interventions

Beyond Shannon: Generalized entropy landscapes

Entropy is expected surprise across possible outcomes. Shannon entropy uses the normalmean (e.g.

∑x Surprise(x)×p(x)) as expectation and log(1/p(x)) for surprise. Different types

of mean, and different types of surprise, can be used (Crupi and Tentori, 2014). Sharma-Mittal entropy (Nielsen and Nock, 2011) provides a unifying framework:

• Sharma-Mittal entropy: Hα,β(p) =1

1−β

((∑i p(x)

α)1−β1−α − 1

), where α > 0, α 6= 1, and β 6= 1

•Renyi (1961) entropy if degree (β) = 1• Tsallis (1988) entropy if degree (β) = order (α)• Error entropy (probability gain; Nelson et al. (2010)) if degree (β) = 2, order (α) = ∞

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.11 0.11

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.1 0.1

0.11

0.11

0.11

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.09 0.0

9

0.1

0.11

0.11

0.11

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.07

0.07

0.0

7

0.09

0.09

0.0

9

0.1

0.1

0.1

0.11

0.11

0.1

1

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.5 0.5 0.55 0.6 0.65

p(x3

) = 0

p(x2) = 0

p(x1) = 0 0.4

0.4

0.4 0.45 0.45 0.5

0.5

0.5 0.55 0.6

0.65

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.3

0.3

0.3 0.4

0.4

0.4

0.45

0.45

0.4

5

0.5

0.5

0.55

0.6

0.65

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.15 0.1

5

0.2

0.25 0.2

5

0.3

0.3

0.3

0.35

0.35

0.35

0.4

0.4

0.4

0.45

0.45

0.45 0.5

0.5

0.5

0.55

0.6

0.65

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.1

0.1

0.1

0.2

0.2

0.2

0.25

0.25

0.2

5

0.3

0.3

0.3

0.35

0.35

0.3

5

0.4

0.4

0.4

0.45

0.45

0.4

5 0.5

0.5

0.5

0.55 0.6

0.65

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.7 0.8 0.9 1

1

1

1

1

1

1 1

1 1

p(x3

) = 0

p(x2) = 0

p(x1) = 0 0.5

0.5

0.5 0.6 0.6 0.7

0.7

0.8 0.9

1

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.3

0.3

0.3

0.5

0.5

0.5 0.6

0.6

0.6 0.7

0.7

0.7 0.8

0.9

1

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.3 0.3

0.4

0.4

0.4

0.5

0.5

0.5

0.6

0.6

0.6 0.7

0.7 0.8

0.9

1

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.1

0.1

0.1

0.2

0.2

0.2

0.3

0.3

0.3

0.4

0.4

0.4

0.5

0.5

0.5

0.6

0.6

0.6

0.7 0.8

0.9 1

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.8 0.8 1 1.2

1.2

1.4

p(x3

) = 0

p(x2) = 0

p(x1) = 0 0.6

0.6

0.6 0.8

0.8

0.8 1 1.2

1.4

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.4

0.4

0.4 0.6

0.6

0.6 0.8

0.8

0.8 1

1.2

1.4

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.4

0.4

0.4 0.6

0.6

0.6 0.8

0.8

0.8 1

1.2

1.4

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6

0.8

0.8

0.8

1 1.2

1.4

p(x3

) = 0

p(x2) = 0

p(x1) = 0

1 1.2 1.4 1.6

1.6

1.8

p(x3

) = 0

p(x2) = 0

p(x1) = 0 0.6 0.6 0.8

0.8

0.8 1

1

1.2 1.4

1.6

1.8

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.4

0.4

0.4 0.6

0.6

0.6 0.8

0.8

0.8 1

1

1.2 1.4

1.6

1.8

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6 0.8

0.8

0.8 1

1.2 1.4

1.6

1.8 p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.4

0.4

0.4

0.6

0.6

0.6

0.8

0.8

0.8

1 1.2

1.4 1.6

1.8

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.35

0.4

0.45

0.5

0.5

0.55

0.55

0.5

5

0.6

0.6

0.6

0.65

0.65

0.6

5

0.7

0.7

0.7

0.75

0.75

0.7

5

0.8

0.8

0.8

0.85

0.85

0.8

5

0.9

0.9

0.9

0.95

0.95

0.9

5

p(x3

) = 0

p(x2) = 0

p(x1) = 0

Renyi

Tsallis

Shannon

Order (α).5 1 2 10

.5

1

2

10

In�nity

Probability gain

Deg

ree

(β)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

Generalised entropy

Probability of element

Ele

men

twis

e co

ntrib

utio

n to

ent

ropy

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

Generalised surprise values


Ato

mic

sur

pris

e

Tsallis 0.1 Tsallis 0.5Tsallis 1 (Shannon surprise)Tsallis 2 (Quadratic surprise)Tsallis 10

.1

.1

Weird and wonderful (bad) entropy functions

To further broaden the space of entropy functions considered, we first created a variety ofatomic surprise functions. From these we derived corresponding entropy functions, alwaysdefining entropy as

∑x Surprise(x) × p(x). The resulting entropy functions vary in which

entropy axioms (Csiszar, 2008) they satisfy.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

'Bad' atomic surprise functions


Ato

mic

sur

pris

e

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

'Bad' elementwise entropies


Ele

men

tal c

ontr

ibut

ion

to e

ntro

py

Central kinkEarly kinkLate kinkThree stepFive step

0.1

0.1

0.1

0.2

0.2

0.2

0.3

0.3

0.3

0.4

0.4

0.4

0.5

0.5

0.5

0.6

0.7

0.8

0.9

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.14

0.1

4

0.14

0.1

4

0.14

0.14

0.14 0.1

4

0.1

6

0.16

0.16

0.18

0.18

0.18

0.18

0.18

0.18

0.1

8

0.2

0.2

0.22

0.22

0.24

0.24

0.24

0.24

0.24

0.24 0.26

0.26

0.26 0.28

0.28

0.28

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.2

0.2

0.2

0.3

0.3

0.3

0.4

0.4

0.4

0.5 0

.5

0.6

0.6

0.6

0.7

0.7

0.7

0.8

0.8

0.8

0.9

0.9

0.9

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.1

0.1

0.1

0.2

0.2

0.2

0.3

0.3

0.3

0.4

0.4

0.4

0.5

0.5

0.5

0.6

0.6

0.6

0.7

0.7

0.7

0.8

0.8

0.8

p(x3

) = 0

p(x2) = 0

p(x1) = 0

0.1

0.1

0.1

0.1

0.1

0.1

0.2

0.2

0.2

0.2

0.2 0

.2

0.3

0.4

0.4

0.5

0.5 0.6

0.6

0.6 p(x3

) = 0

p(x2) = 0

p(x1) = 0

Central kink Early kink Late kink Three random steps Five random steps

Simulating many strategies in many environments

• Learners based on 5 × 5 entropies in Sharma-mittal space α, β ∈ [1/10, 1/2, 1, 2, 10], plusour 5 weird and wonderful entropy functions.• Test cases: all 25 of the 3-variable causal graphs, through 8 sequentially chosen interven-

tions. Repeated simulations 10 times each in 3× 3 environments:1. Spontaneous activation rates of [.2, .4, .6]2. Causal powers of [.4, .6, .8]

1 2 3 4 5 6 7 8

0.10

0.15

0.20

0.25

0.30

0.35

Generalised entropy simulations

N interventions performed

Pro

babi

lity

corr

ect

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

a=1/10 b=1/10 (Tsallis low)a=1/10, b=1 (Renyi low)a=10, b=10 (Tsallis high)a=10, b=1 (Renyi high)

a=1, b=1 (Shannon)a=Inf, b=2 (Probability gain)

0 2 4 6 8

0.05

0.10

0.15

0.20

0.25

0.30

0.35

'Bad' entropy simulations

N interventions performed

Pro

babi

lity

corr

ect

●

●

●

●

●

●

●●

●

● ● ●●

●

●

●

●

●

●

●

Central kinkEarly kinkLate kink3 step5 step

Results

• Shannon entropy achieves higher accuracy than probability gain•many entropy functions achieve similar accuracy as Shannon entropy• but the interventions chosen vary widely:

Human data Probability gain Shannon Tsallis low Renyi low Tsallis high Renyi high Central kink Early kink Late kink Three step Five step

Intervention choice type proportions (N = 18000)

0.0

0.2

0.4

0.6

0.8

1.0

ObservationOne−onOne−on, one−offTwo−onOne−offTwo−off

Questions:

• in which instances do different strategies strongly contradict each other?•which properties of entropy functions are important for particular situations?•when are particular entropy functions cognitively plausible?

ReferencesBramley, N. R., Lagnado, D. A., and Speekenbrink, M. (2014). Forgetful conservative scholars - how people learn causal structure

through interventions. Journal of Experimental Psychology: Learning, Memory and Cognition, doi:10.1037/xlm0000061.

Crupi, V. and Tentori, K. (2014). State of the field: Measuring information and confirmation. Studies in History and Philosophy ofScience Part A, 47:81–90.

Csiszar, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3):261–273.

Nelson, J. D., McKenzie, C. R., Cottrell, G. W., and Sejnowski, T. J. (2010). Experience matters information acquisition optimizesprobability gain. Psychological science, 21(7):960–969.

Nielsen, F. and Nock, R. (2011). A closed-form expression for the sharma-mittal entropy of exponential families. arXiv preprintarXiv:1112.4221.

Pearl, J. (2000). Causality. Cambridge University Press (2nd edition), New York.

Renyi, A. (1961). On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability,pages 547–561.

Sprites, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search (Springer Lecture Notes in Statistics). MITPress. (2nd edition, 2000), Cambridge, MA.

Steyvers, M. (2003). Inferring causal networks from observations and interventions. Cognitive Science, 27(3):453–489.

Tsallis, C. (1988). Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52(1-2):479–487.

1

what should an active causal learner value?bramley/posters/psychonomics2014.pdf · what should an...

Documents