explorations in language learnability using probabilistic grammars and child-directed speech amy...

Explorations in language learnability using probabilistic grammars and child-directed speech

Amy Perfors & Josh Tenenbaum, MIT

Terry Regier, U Chicago

Thanks also: the MIT Computational Cognitive Science group, Adam Albright, Jeff Elman, Danny Fox, Ted Gibson, Sharon Goldwater, Mark Johnson, Jay McClelland, Raj Singh, Ken Wexler, Fei Xu, NSF

Everyday inductive leaps

How can people learn so much about the world from such limited data?– Categorizing objects and predicting their properties

– Causal reasoning

– Using language and learning about linguistic meaning

– Understanding others’ actions, plans, thoughts, goals

– Inferring social structures, conventions, rules, morals

The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful.

• The big questions:– How can abstract knowledge guide generalization from sparsely observed

data?– What is the form and content of abstract knowledge, across different

domains?– How could abstract knowledge itself be acquired?

• A computational toolkit for addressing these questions:– Bayesian inference in probabilistic generative models.– Probabilistic models defined over structured representations: graphs,

grammars, schemas, predicate logic, lambda calculus, functional programs.– Hierarchical probabilistic models, with inference at multiple levels of

abstraction and multiple timescales.

The approach: reverse engineering induction

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

“UG” ?

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

(c.f. Chater & Manning, TiCS 2006)

P(grammar | UG)

(Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 )

Vision as probabilistic parsing

Scene graph

Surface configuration

Image

Form

Structure

Data

Tree with species at leaf nodes

Learning about categories, labels, and hidden properties

mouse

squirrel

chimp

gorilla

rodent

primate

animal

AbstractPrinciples

CausalStructure

EventData

(Tenenbaum, Griffiths, Kemp, Niyogi, et al.)

Learning causal theories

Behaviors can cause DiseasesDiseases can cause Symptoms

Magnets attract Metal.Every Magnet has a North Pole and a South Pole.Opposite magnetic poles attract; Like magnetic poles repel.

N

S

S

N

-

+

+

-

+

++

+

Goal-directed action (production and comprehension)

(Wolpert, Doya and Kawato, 2003)

VerbVP

NPVPVP

VNPRelRelClause

RelClauseNounAdjDetNP

VPNPS

][

][][

Phrase structure

Utterance

Speech signal

Grammar

UG ?

P(phrase structure | grammar)

P(utterance | phrase structure)

P(speech | utterance)

(c.f. Chater and Manning, 2006)

P(grammar | UG)

• The generic form: – Children acquiring language infer the correct forms of

complex syntactic constructions for which they have little or no direct evidence.

– They avoid simple but incorrect generalizations that would be consistent with their data, preferring much subtler rules that just happen to be correct.

– How do they do this? They must have some inductive bias – some abstract knowledge about how language works – leading them to prefer the correct hypotheses even in the absence of direct supporting data. That abstract knowledge is UG.

The “Poverty of the Stimulus” argument

A “Poverty of the Stimulus” argument

Simple declarative: The girl is happy. They are eating

Simple interrogative: Is the girl happy? Are they eating?

H1. Linear: move the first auxiliary in the sentence to the beginning.

H2. Hierarchical: move the auxiliary in the main clause to the beginning.

Generalization

Hypotheses

Data

Complex declarative: The girl who is sleeping is happy.

Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1]

Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words.

=> Inductive constraint

E.g., aux-fronting in complex interrogatives:

No Yes

Hierarchical phrase structure

• The Question: What form do constraints take and how do they arise? (When) must they be innately specified as part of the initial state of the language faculty?

• The Claim: It is possible that, given the data of child-directed speech and certain innate domain-general capacities, an unbiased ideal learner can recognize the hierarchical phrase structure of language; perhaps this inductive constraint need not be innately specified in the language faculty.

• Assumed domain-general capacities: – Can represent grammars of various types: hierarchical, linear, …

– Can evaluate the Bayesian probability of a grammar given a corpus.

Our argument

How?• By inferring that a hierarchical phrase-structure grammar offers the

best tradeoff between simplicity and fit to natural language data. • Evaluating candidate grammars based on simplicity is an old

idea…– E.g., Chomsky, MMH, 1951: “As a first approximation to the notion of

simplicity, we will here consider shortness of grammar as a measure of simplicity, and will use such notations as will permit similar statements to be coalesced…. Given the fixed notation, the criteria of simplicity governing the ordering of statements are as follows: that the shorter grammar is the simpler, and that among equally short grammars, the simplest is that in which the average length of derivation of sentences is least.”

– LSLT: Applies this idea to a multi-level generative system.

• A long history of related formal analyses.– Gold, Horning, Angluin, Berwick, Muggleton, Chater & Vitanyi, …

• In contrast to our work, previous work …often used simplicity metrics that were either arbitrary or not computable. Bayes has several

advantages:• Gives a rational, objective way to trade off simplicity and data fit. • Prescribes ideal inferences from any amount of data, not just infinite limit.• Naturally handles ambiguity, noise, missing data.

… typically considered highly simplified languages or an idealized corpus: infinite data, with all grammatical sentence types observed eventually and empirical frequencies given by the true grammar.

• The child’s corpus is very different! Small finite sample of sentence types from a very complex language, with a distribution that might depend on many other factors: semantics, pragmatics, performance, etc.

… focused on theorems. Our work is mostly based on empirical exploration.

Ideal learnability analyses

The landscape of learnability analyses

Our focus here

ideal learnerideal data

Can X be learned from data?

realistic learnerideal data

ideal learnerrealistic data

realistic learnerrealistic data

The Bayesian model

T: type of grammar

G: Specific grammar

D: Data

Context-freeRegular

Flat, 1-state

Unbiased (uniform)

The Bayesian model

T: type of grammar

G: Specific grammar

D: Data

Context-freeRegular

Flat, 1-state

SimplicityFit to data

“likelihood” “prior”

Fit: poor good best

Simplicity: best good poor

Data D

Grammars G

Bayesian learning: trading fit vs. simplicity

Data D


Grammars G

Likelihood: low high highest),|( TGDp

Prior: highest high low)|( TGp T = 1 region T = 2 regions T = 13 regions

c.f. Subset principle

Balance between fit and simplicity should be sensitive to the amount of data observed…


c.f. Subset principle

The prior Measuring simplicity of a grammar

)|( TGp

• A probabilistic grammar for grammars (c.f., Horning):

• Grammars with more rules and more non-terminals will have lower prior probability.

n = # of nonterminals Pk = # productions expanding nonterminal kΘk = probabilities for expansions of nonterminal k Ni = # symbols in production i V = vocabulary size

• Probability of the corpus being generated from the grammar:

• Grammars that assign long derivations to sentences will be less probable.

Ex: pro aux det n

0.5*0.25*1.0*0.25*0.5 = 0.016

Probability of parse:

The likelihood ),|( TGDpMeasuring fit of a grammar

sentences

TGipTGDp ),| sentence(),|(

sentences parses

TGijp ),| sentence of parse(

• Grammars that generate sentences not observed in the corpus will be less probable, because they waste probability mass. (“indirect negative evidence”)

Different grammar types

“Flat” grammar

Rules

List of each sentence

Example

Regular grammar

Rules

NT t NT

Example

NT t

Linear

Rules

Example

1-state grammar

Anything accepted

Context-free grammar

Rules Example

NT NT NT

NT t NT

NT NT

NT t

Hierarchical

CFG-L

Description

Derived from CFG-S, with additional productions that put less probability mass

on recursive productions (and hence overgenerate less).

Example productions

120 rules, 14 non-terminals

Hierarchical grammars

Simpler, looser fit to data

More complex, tighter fit to data

CFG-S

Description

Designed to be as linguistically plausible (and as compact) as possible

Example productions

69 rules, 14 non-terminals

FLAT

List of each

sentence

2336 rules, 0 non-

terminals

1-STATE

Any sentence accepted

(unigram prob. model)

25 rules, 0 non-

terminals

Simplest, poorest fit

Most complex, exact fit

REG-N

Narrowest regular derived from CFG

389 rules, 85 non-

terminals

Mid-level regular derived from CFG

REG-M

169 rules, 13 non-

terminals

REG-B

Broadest regular derived from CFG

117 rules, 10 non-

terminals

Linear grammars

+ Local search refinements, automatic grammar construction based on machine learning methods (Goldwater & Griffiths, 2007)

Data• Child-directed speech (CHILDES database, Adam, Brown corpus, age range 2;3

to 5;2)• Each word replaced by syntactic category, e.g.

det adj n v prop prep adj n (the baby bear discovers Goldilocks in his bed)

wh aux det n part (what is the monkey eating?)

pro aux det n comp v det n (this is the man who wrote the book)

aux pro det adj n (is that a green car?)

n comp aux adj aux vi (eagles that are alive do fly

• Final data comprise 2336 individual sentence types (corresponding to 21671 sentence tokens).

Results: Full corpus

Simpler Tighter fit

(Note: these are -log probabilities, so lower = better!)

),|( TGDp)|( TGpPrior Likelihood Posterior

)|,( DTGp

CFGS L

REGB M N

CFGS L

REGB M N

CFGS L

REGB M N

Type In corpus?

Example FLAT RG-N RG-M RG-B 1-ST CFG-S CFG-L

Simple Declarative

Eagles do fly. (n aux vi)

Simple Interrogative

Do eagles fly? (aux n vi)

Complex Declarative

Eagles that are alive do fly. (n comp aux adj aux vi)

Complex Interrogative

Do eagles that are alive fly? (aux n comp aux adj vi)

Complex Interrogative

Are eagles that alive do fly? (aux n comp adj aux vi)

e.g., complex aux-fronted interrogatives:

Results: Generalization

How well does each grammar predict unseen sentence forms?

Results: First file (90 mins)

Simpler Tighter fit

),|( TGDp)|( TGpPrior Likelihood Posterior

)|,( DTGp

CFGS L

REGB M N

CFGS L

REGB M N

CFGS L

REGB M N

(Note: these are -log probabilities, so lower = better!)

• Contemporary probabilistic modeling provides new tools for formalizing and analyzing stimulus poverty arguments.

• Suggests an alternative to standard linguistic nativism– Hierarchical phrase structure is a crucial constraint on syntactic acquisition, but perhaps need not be

specified innately in the language faculty. It can be inferred from child-directed speech by an ideal learner equipped with innate domain-general capacities to represent grammars of various types and to perform Bayesian inference.

• Many open questions– How close have we come to finding the best grammars of each type? – How would these results extend to richer representations of syntax, or jointly learning multiple levels

of linguistic structure? – What are the prospects for computationally or psychologically realistic learning algorithms that might

approximate this ideal learner?– What is the actual source of hierarchical structure in syntax?

Conclusions: poverty of the stimulus

Conclusions: more general• Bayesian inference over hierarchies of structured representations provides a way

to study core questions of human language in a domain-general framework:– What is the content and form of abstract knowledge? – How can abstract knowledge guide generalization from sparse data? – How can abstract knowledge be acquired? What must be built in?

• A way to move beyond traditional dichotomies.– How can structured knowledge be acquired by statistical learning?– How can domain-general learning mechanisms acquire domain-specific inductive

constraints?

• A different way to think about cognitive development.– Powerful abstractions can be inferred “top down”, from surprisingly little data, together

with learning more concrete knowledge. – Very different from the traditional empiricist or nativist views of abstraction. Worth

pursuing more generally…

– Word Learning Whole object bias Taxonomic principle (Markman)

Shape bias (Smith)

– Causal reasoning Causal schemata (Kelley)

– Folk physics Objects are unified, persistent (Spelke)

– Number Counting principles (Gelman) – Folk biology Principles of taxonomic rank (Atran)

– Folk psychology Principle of rationality (Gergely)

– Ontology M-constraint on predicability (Keil)

– Syntax UG (Chomsky)

– Phonology Faithfulness, Markedness constraints (Prince, Smolensky)

Abstract knowledge in cognitive development

explorations in language learnability using probabilistic grammars and child-directed speech amy...

Documents

probabilistic grammars

nsf slide

bayesian inference

pgrammar ug slide

language learnability

limited data

linguistic meaning

multiple timescales