information theory, mdl and human cognition

49
Information theory, MDL and human cognition Nick Chater Department of Psychology University College London [email protected]

Upload: fatima-byrd

Post on 30-Dec-2015

27 views

Category:

Documents


2 download

DESCRIPTION

Information theory, MDL and human cognition. Nick Chater Department of Psychology University College London [email protected]. Overview. Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information theory, MDL and human cognition

Information theory, MDL and human cognition

Nick ChaterDepartment of PsychologyUniversity College London

[email protected]

Page 2: Information theory, MDL and human cognition

Overview

Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications

Page 3: Information theory, MDL and human cognition

Bayes and MDL: A simplified story

Shannon’s coding theorem. For distribution Pr(A), optimal code will assign -

log2Pr(A) to code event A

MDL model selection: choose, M, that yields the shortest code for D, i.e.,

minimize: -log2Pr(D, M)

Page 4: Information theory, MDL and human cognition

A simple equivalence

Minimize: -log2Pr(D, M)

Maximize: Pr(D, M) = Pr(M|D)Pr(D)

Maximize: Pr(M|D)

Just what Bayes recommends (if choosing a single model)

Equivalence generalizes to parametric M(θ) to ‘full’ Bayes; and in other ways

Chater, 1996, for application to the simplicity and likelihood principles in perceptual organization

Page 5: Information theory, MDL and human cognition

Codes or priors? Which comes first? 1. The philosophical issue

Bayesian viewpoint: Probabilities as basic

calculus for degrees of beliefs (probability theory)

decision theory (probabilities meet action) brain as a probabilistic calculation

machine (whether belief propagation, dynamic programming…)

Page 6: Information theory, MDL and human cognition

Simplicity/MDL viewpoint: Codes as basic

Rissanen: data is all there is; distributions are a fiction

Code structure is primary; code interpretation is secondary

Probabilities defined over events; but “events” are cognitive constructs

Leeuwenberg & Boselie (1988)

Page 7: Information theory, MDL and human cognition

Codes or priors? Which comes first? 2. The practical issue

MDL viewpoint: Take codes as

basic… …when we know

most about representation, e.g., grammars

Bayesian viewpoint: Take probabilities

as basic… …when we know

most about probability, e.g, image statistics:

Page 8: Information theory, MDL and human cognition

Bayesian viewpoint (e.g., Geisler et al, 2001)

Good continuation—most lines continue in the same direction in real images

Page 9: Information theory, MDL and human cognition

Simplicity/MDL viewpoint (e.g., Goldsmith, 2001)

In, e.g., linguistics, representations are given by theory

And we can roughly assess the complexity of grammars (by length)

Not so clear how directly to set a prior over all grammars

(though can define a generative process in simple cases…)

S NP VPVP V NP

VP V NP PP NP Det Noun

NP NP PP

“Binding contraints”

Gzip as a handy approximation!?!

Page 10: Information theory, MDL and human cognition

Simplicity/MDL and Bayes are closely related

Lets now explore the simplicity perspective

Page 11: Information theory, MDL and human cognition

Overview

Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications

Page 12: Information theory, MDL and human cognition

The most neutral possible prior…

Suppose we want a prior so neutral that it never rules out a model

Possible, if limit to computable models

Mixture of all (computable) priors, with weights, i, that decline fairly fast:

Then, this multiplicatively dominates all priors

though neutral priors will mean slow learning

m(x) are “universal” priors

i

ii xpxm )()(

Page 13: Information theory, MDL and human cognition

The most neutral possible coding language

Universal programming languages (Java, matlab, UTMs, etc)

K(x) = length of shortest program in Java, matlab, UTM, that generates x (K is uncomputable)

Invariance theorem any languages L1, L2, c, x |KL1(x)-KL2(x)| ≤ c

Mathematically justifies talk of K(x), not KJava(x) , KMatlab(x),…

Page 14: Information theory, MDL and human cognition

So does this mean that choice of language doesn’t matter?

Not quite! c can be large

And, for any L1, c0, L2, x such that |KL1(x)-KL2(x)| ≥ c0

The problem of the one-instruction code for the entire data set…

But Kolmogorov complexity can be made concrete…

Page 15: Information theory, MDL and human cognition

Compact Universal Turing machines

210 bits, λ-calculus

272, combinators

Due to Jon Tromp, 2007

Not much room to hide, here!

Page 16: Information theory, MDL and human cognition

Neutral priors and Kolmogorov

complexity A key result:

K(x) = -log2m(x) o(1)

Where m is a universal prior

Analogous to the Shannon’s source coding theorem

And for any computable q,

K(x) ≤ -log2q(x) o(1) For typical x drawn

from q(x)

Any data, x, that is likely for any sensible probability distribution has low K(x)

Page 17: Information theory, MDL and human cognition

Prediction by simplicity

Find shortest ‘program/explanation’ for current ‘corpus’ (binary string)

Predict using that program Strictly, use ‘weighted sum’ of explanations,

weighted by brevity

Page 18: Information theory, MDL and human cognition

Prediction is possible (Solomonoff, 1978)Summed error has finite bound

sj is summed squared error between prediction and true probability on item j

So prediction converges [faster than 1/nlog(n)], for corpus size n

Computability assumptions only (no stationarity needed)

)(2

2logs

1=jj Ke

Page 19: Information theory, MDL and human cognition

Summary so far…

Simplicity/MDL - close and deep connections with Bayes

Defines universal prior (i.e., based on simplicity)

Can be made “concrete” General prediction results A convenient “dual” framework to Bayes,

when codes are easier than probabilitiesLi, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to Kolmogorov complexity theory and its applications. Berlin: Springer.

Page 20: Information theory, MDL and human cognition

Overview

Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications

Page 21: Information theory, MDL and human cognition

A problem of model selection? Or: why simplicity won’t go away

Where do priors come from? Well, priors can be given by hyper-priors And hyper-priors by hyper-hyper-priors But it can’t go on forever!

And we need priors over models we’re only just thought of

And, in some contexts, over models we haven’t yet thought of (!)

Code length in our representation language is a fixed basis Nb. Building probabilistic models = augmenting our language

with new coding schemes

Page 22: Information theory, MDL and human cognition

The hidden role of simplicity…

Bayesian model selection prefers

y(x)=a2x2+a1x+a0 Not y(x)=a125x125+a124x124+

…+a0

But who says how many parameters a function has got??

Page 23: Information theory, MDL and human cognition

A trick…

Convert parameters to constants y(x)=a125x125+a124x124+…+a0

126 parameters y(x)=.003x125 + .02x124+…+3x – 24.3

0 parameters

And hence is favoured by Bayesian (and all other) model selection criteria

All the virtues of theft over honest toil

Page 24: Information theory, MDL and human cognition

Zoubin’s problem for ML

ML Gaussian is a delta function on one point

Page 25: Information theory, MDL and human cognition

ML: Gaussian is a delta functions on one point

An impressive fit!

Page 26: Information theory, MDL and human cognition

A related problem for Bayes?

The mixture of delta functions model (!)

Page 27: Information theory, MDL and human cognition

A related problem for Bayes?

The mixture of delta functions model (!)

An even more impressive fit!

Page 28: Information theory, MDL and human cognition

Should the “cheating” model get a huge boost from this data

No! Sense of moral

outrage Model must be

fitted post-hoc It would be

different if I’d thought of it before the data arrived (cf empirical Bayes)

Yes! !

But order of data acquisition has no role in Bayes

Confirmation is just the same, whenever I thought of the modelThe models get a spectacular boost; but is even

more spectacularly unlikely…

Page 29: Information theory, MDL and human cognition

So we need to take care with priors!

y = x High prior; compact to state

y=.003x125 + .02x124+…+3x-24 Low prior; not compact to state

With a different representation language, could have the opposite bias

But we start from where we are; our actual representations

We can discoverthat things are simpler than we though (i.e., simplicity is not quite so subjective…)

Page 30: Information theory, MDL and human cognition

Overview

Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications

Page 31: Information theory, MDL and human cognition

There are quite a few…Domain Principle References

Perceptual organization Find grouping that minimizes cost Koffka, 1935; Leeuwenberg, 1971; Attneave & Frost, 1969;

Early vision Efficient coding & transmission Blakemore, 1990; Barlow, 1974; Srivinisan, Laughlin

Causal reasoning Find minimal belief network Wedelind

Similarity Similarity between representations measured by code length between them

Chater & Vitányi, 2003; Hahn, Chater & Richardson, 2003

Categorization Categorize items to find shortest code (high level perceptual organization)

Pothos & Chater, 2002; Feldman, 2000

Memory storage Shorter codes easier to store Chater, 1999

Memory retrieval Explain interference by cuetrace complexity; a rational foundation for distinctiveness models

Rational foundation for distinctiveness models: SIMPLE (Brown, Neath & Chater, 2005)

Language acquisition Find grammar that best explains child’s input

Chomsky, 1955; J. D. Fodor & Crain; Chater, 2004; Chater & Vitányi, 2005

Page 32: Information theory, MDL and human cognition

Here:

Perceptual organization Language Acquisition Similarity and generalization

Page 33: Information theory, MDL and human cognition

Long tradition of simplicity in perception (Mach, Koffka, Leeuwenberg); e.g., Gestalt laws

+

++++

+

(x,v)

(x,v)

(x,v)

(x,v)(x,v)

(x,v)(x)

(x)

(x)

(x)(x)

(x) (v)

Grouped 6 + 1 vectors

Ungrouped 6 x 2 vectors

Page 34: Information theory, MDL and human cognition

And language acquisition: where it helps resolve an apparent learnability paradox

Undergeneral grammars predict that good sentences are not allowed

just wait til one turns up

Overgeneral grammars predict that bad sentences are actually ok

Need negative evidence---say a bad sentence, and get corrected

Page 35: Information theory, MDL and human cognition

The logical problem of language acquisition (e.g., Hornstein & Lightfoot, 1981; Pinker, 1979)

Without negative evidence can never eliminate overgeneral grammars

“Mere” non-occurrence of sentences is not enough… …because almost all acceptable sentences also never

occur Backed-up by formal results (Gold, 1967; though Feldman,

Horning et al)

Argument for innateness?

Page 36: Information theory, MDL and human cognition

An “ideal” learning set-up (cf ideal observers)

Linguistic environment

Measures of learning performance

Learning method

Positive evidence only; computability

Statistical

Simplicity

Page 37: Information theory, MDL and human cognition

Overgeneralization Theorem (Chater & Vitányi) 

Suppose learner has probability j of erroneously

guessing an ungrammatical jth word

Intuitive explanation: overgeneralization underloads probabilities of

grammatical sentences; Small probabilities implies longer code lengths

2log

)(

1 ejj

K

Page 38: Information theory, MDL and human cognition

Absence as implicit negative evidence

Overgeneral grammars predict missing sentences And their absence is a clue that the grammar is

wrong

Method can be “scaled-down” to consider learnability of specific linguistic constructions

Page 39: Information theory, MDL and human cognition

Similarity and categorization

Cognitive dissimilarity: representational “distortion” required to get from x to y DU(x,y) = K(y|x)

Not symmetrical K(y|x) > K(x|y) when K(y) > K(x)

Deletion is easy…

Page 40: Information theory, MDL and human cognition

Shepard’s (1987) Universal Law

Generalization (strictly confusability) is an exponention function of

psychological “distance”

),(),( baU SSBDAebaG

Page 41: Information theory, MDL and human cognition

A derivation

Shepard’s generalization measure

for “typical” items

2/1

)|Pr()|Pr(

)|Pr()|Pr(),(

bbaa

abba

SRSR

SRSRbaG

))|Pr(log)|Pr(log)|Pr(log)|Pr((log2/1

),(log

2222

2

bbaaabba SRSRSRSR

baG

)1()|()|(2/1 oSRKSRK abba

Page 42: Information theory, MDL and human cognition

Assuming items roughly the same complexity

The universal law

)1()|()|(2/1),(log 2 oSRKSRKbaG abba

)1(),( oSSD baU

),(),( baU SSBDAebaG

Page 43: Information theory, MDL and human cognition

The asymmetry of similarity

What thing is this like?

Page 44: Information theory, MDL and human cognition

And what is this like?

Page 45: Information theory, MDL and human cognition

A heuristic measure of amount of information: Shannon’s guessing game…

1. Pony?2. Cow?3. Dog?…345. Pegasus√

1. Pony?2. Cow?3. Dog?…345. Pegasus√

345!

Page 46: Information theory, MDL and human cognition

Asymmetry of codelengths asymmetry of similarity

Horse: guess #345 gets Pegasus. log2Pr(#345) is very small.

Pegasus: guess #2 gets Horse. log2(Pr(#2)) is very small.

So Pegasus is more like Horse, than Horse is like Pegasus

Many other examples of asymmetry, and many measures (search times, memory confusions…), which seem to fit this pattern

Page 47: Information theory, MDL and human cognition

Treisman & Souther (1985)

A simple array

Page 48: Information theory, MDL and human cognition

A complex array

Page 49: Information theory, MDL and human cognition

Summary

MDL/Kolmogorov complexity close relation with Bayes

Basis for a “universal” prior Variety of applications to cognitive

science