explorations in language learnability using probabilistic grammars and child-directed speech amy...
TRANSCRIPT
Explorations in language learnability using probabilistic grammars and child-directed speech
Amy Perfors & Josh Tenenbaum, MIT
Terry Regier, U Chicago
Thanks also: the MIT Computational Cognitive Science group, Adam Albright, Jeff Elman, Danny Fox, Ted Gibson, Sharon Goldwater, Mark Johnson, Jay McClelland, Raj Singh, Ken Wexler, Fei Xu, NSF
Everyday inductive leaps
How can people learn so much about the world from such limited data?– Categorizing objects and predicting their properties
– Causal reasoning
– Using language and learning about linguistic meaning
– Understanding others’ actions, plans, thoughts, goals
– Inferring social structures, conventions, rules, morals
The goal: A general-purpose computational framework for understanding of how people make these inductive leaps, and how they can be successful.
• The big questions:– How can abstract knowledge guide generalization from sparsely observed
data?– What is the form and content of abstract knowledge, across different
domains?– How could abstract knowledge itself be acquired?
• A computational toolkit for addressing these questions:– Bayesian inference in probabilistic generative models.– Probabilistic models defined over structured representations: graphs,
grammars, schemas, predicate logic, lambda calculus, functional programs.– Hierarchical probabilistic models, with inference at multiple levels of
abstraction and multiple timescales.
The approach: reverse engineering induction
VerbVP
NPVPVP
VNPRelRelClause
RelClauseNounAdjDetNP
VPNPS
][
][][
Phrase structure
Utterance
Speech signal
Grammar
“UG” ?
P(phrase structure | grammar)
P(utterance | phrase structure)
P(speech | utterance)
(c.f. Chater & Manning, TiCS 2006)
P(grammar | UG)
(Han & Zhu, 2006; c.f., Zhu, Yuanhao & Yuille NIPS 06 )
Vision as probabilistic parsing
Scene graph
Surface configuration
Image
Form
Structure
Data
Tree with species at leaf nodes
Learning about categories, labels, and hidden properties
mouse
squirrel
chimp
gorilla
rodent
primate
animal
AbstractPrinciples
CausalStructure
EventData
(Tenenbaum, Griffiths, Kemp, Niyogi, et al.)
Learning causal theories
Behaviors can cause DiseasesDiseases can cause Symptoms
Magnets attract Metal.Every Magnet has a North Pole and a South Pole.Opposite magnetic poles attract; Like magnetic poles repel.
N
S
S
N
-
+
+
-
+
++
+
Goal-directed action (production and comprehension)
(Wolpert, Doya and Kawato, 2003)
VerbVP
NPVPVP
VNPRelRelClause
RelClauseNounAdjDetNP
VPNPS
][
][][
Phrase structure
Utterance
Speech signal
Grammar
UG ?
P(phrase structure | grammar)
P(utterance | phrase structure)
P(speech | utterance)
(c.f. Chater and Manning, 2006)
P(grammar | UG)
• The generic form: – Children acquiring language infer the correct forms of
complex syntactic constructions for which they have little or no direct evidence.
– They avoid simple but incorrect generalizations that would be consistent with their data, preferring much subtler rules that just happen to be correct.
– How do they do this? They must have some inductive bias – some abstract knowledge about how language works – leading them to prefer the correct hypotheses even in the absence of direct supporting data. That abstract knowledge is UG.
The “Poverty of the Stimulus” argument
A “Poverty of the Stimulus” argument
Simple declarative: The girl is happy. They are eating
Simple interrogative: Is the girl happy? Are they eating?
H1. Linear: move the first auxiliary in the sentence to the beginning.
H2. Hierarchical: move the auxiliary in the main clause to the beginning.
Generalization
Hypotheses
Data
Complex declarative: The girl who is sleeping is happy.
Complex interrogative: Is the girl who is sleeping happy? [via H2] *Is the girl who sleeping is happy? [via H1]
Induction of specific grammatical rules must be guided by some abstract constraints to prefer certain hypotheses over others, e.g., syntactic rules are defined over hierarchical phrase structures rather than linear order of words.
=> Inductive constraint
E.g., aux-fronting in complex interrogatives:
No Yes
Hierarchical phrase structure
• The Question: What form do constraints take and how do they arise? (When) must they be innately specified as part of the initial state of the language faculty?
• The Claim: It is possible that, given the data of child-directed speech and certain innate domain-general capacities, an unbiased ideal learner can recognize the hierarchical phrase structure of language; perhaps this inductive constraint need not be innately specified in the language faculty.
• Assumed domain-general capacities: – Can represent grammars of various types: hierarchical, linear, …
– Can evaluate the Bayesian probability of a grammar given a corpus.
Our argument
How?• By inferring that a hierarchical phrase-structure grammar offers the
best tradeoff between simplicity and fit to natural language data. • Evaluating candidate grammars based on simplicity is an old
idea…– E.g., Chomsky, MMH, 1951: “As a first approximation to the notion of
simplicity, we will here consider shortness of grammar as a measure of simplicity, and will use such notations as will permit similar statements to be coalesced…. Given the fixed notation, the criteria of simplicity governing the ordering of statements are as follows: that the shorter grammar is the simpler, and that among equally short grammars, the simplest is that in which the average length of derivation of sentences is least.”
– LSLT: Applies this idea to a multi-level generative system.
• A long history of related formal analyses.– Gold, Horning, Angluin, Berwick, Muggleton, Chater & Vitanyi, …
• In contrast to our work, previous work …often used simplicity metrics that were either arbitrary or not computable. Bayes has several
advantages:• Gives a rational, objective way to trade off simplicity and data fit. • Prescribes ideal inferences from any amount of data, not just infinite limit.• Naturally handles ambiguity, noise, missing data.
… typically considered highly simplified languages or an idealized corpus: infinite data, with all grammatical sentence types observed eventually and empirical frequencies given by the true grammar.
• The child’s corpus is very different! Small finite sample of sentence types from a very complex language, with a distribution that might depend on many other factors: semantics, pragmatics, performance, etc.
… focused on theorems. Our work is mostly based on empirical exploration.
Ideal learnability analyses
The landscape of learnability analyses
Our focus here
ideal learnerideal data
Can X be learned from data?
realistic learnerideal data
ideal learnerrealistic data
realistic learnerrealistic data
The Bayesian model
T: type of grammar
G: Specific grammar
D: Data
Context-freeRegular
Flat, 1-state
Unbiased (uniform)
The Bayesian model
T: type of grammar
G: Specific grammar
D: Data
Context-freeRegular
Flat, 1-state
SimplicityFit to data
“likelihood” “prior”
Fit: poor good best
Simplicity: best good poor
Data D
Grammars G
Bayesian learning: trading fit vs. simplicity
Data D
Bayesian learning: trading fit vs. simplicity
Grammars G
Likelihood: low high highest),|( TGDp
Prior: highest high low)|( TGp T = 1 region T = 2 regions T = 13 regions
c.f. Subset principle
Balance between fit and simplicity should be sensitive to the amount of data observed…
Bayesian learning: trading fit vs. simplicity
c.f. Subset principle
The prior Measuring simplicity of a grammar
)|( TGp
• A probabilistic grammar for grammars (c.f., Horning):
• Grammars with more rules and more non-terminals will have lower prior probability.
n = # of nonterminals Pk = # productions expanding nonterminal kΘk = probabilities for expansions of nonterminal k Ni = # symbols in production i V = vocabulary size
• Probability of the corpus being generated from the grammar:
• Grammars that assign long derivations to sentences will be less probable.
Ex: pro aux det n
0.5*0.25*1.0*0.25*0.5 = 0.016
Probability of parse:
The likelihood ),|( TGDpMeasuring fit of a grammar
sentences
TGipTGDp ),| sentence(),|(
sentences parses
TGijp ),| sentence of parse(
• Grammars that generate sentences not observed in the corpus will be less probable, because they waste probability mass. (“indirect negative evidence”)
Different grammar types
“Flat” grammar
Rules
List of each sentence
Example
Regular grammar
Rules
NT t NT
Example
NT t
Linear
Rules
Example
1-state grammar
Anything accepted
Context-free grammar
Rules Example
NT NT NT
NT t NT
NT NT
NT t
Hierarchical
CFG-L
Description
Derived from CFG-S, with additional productions that put less probability mass
on recursive productions (and hence overgenerate less).
Example productions
120 rules, 14 non-terminals
Hierarchical grammars
Simpler, looser fit to data
More complex, tighter fit to data
CFG-S
Description
Designed to be as linguistically plausible (and as compact) as possible
Example productions
69 rules, 14 non-terminals
FLAT
List of each
sentence
2336 rules, 0 non-
terminals
1-STATE
Any sentence accepted
(unigram prob. model)
25 rules, 0 non-
terminals
Simplest, poorest fit
Most complex, exact fit
REG-N
Narrowest regular derived from CFG
389 rules, 85 non-
terminals
Mid-level regular derived from CFG
REG-M
169 rules, 13 non-
terminals
REG-B
Broadest regular derived from CFG
117 rules, 10 non-
terminals
Linear grammars
+ Local search refinements, automatic grammar construction based on machine learning methods (Goldwater & Griffiths, 2007)
Data• Child-directed speech (CHILDES database, Adam, Brown corpus, age range 2;3
to 5;2)• Each word replaced by syntactic category, e.g.
det adj n v prop prep adj n (the baby bear discovers Goldilocks in his bed)
wh aux det n part (what is the monkey eating?)
pro aux det n comp v det n (this is the man who wrote the book)
aux pro det adj n (is that a green car?)
n comp aux adj aux vi (eagles that are alive do fly
• Final data comprise 2336 individual sentence types (corresponding to 21671 sentence tokens).
Results: Full corpus
Simpler Tighter fit
(Note: these are -log probabilities, so lower = better!)
),|( TGDp)|( TGpPrior Likelihood Posterior
)|,( DTGp
CFGS L
REGB M N
CFGS L
REGB M N
CFGS L
REGB M N
Type In corpus?
Example FLAT RG-N RG-M RG-B 1-ST CFG-S CFG-L
Simple Declarative
Eagles do fly. (n aux vi)
Simple Interrogative
Do eagles fly? (aux n vi)
Complex Declarative
Eagles that are alive do fly. (n comp aux adj aux vi)
Complex Interrogative
Do eagles that are alive fly? (aux n comp aux adj vi)
Complex Interrogative
Are eagles that alive do fly? (aux n comp adj aux vi)
e.g., complex aux-fronted interrogatives:
Results: Generalization
How well does each grammar predict unseen sentence forms?
Results: First file (90 mins)
Simpler Tighter fit
),|( TGDp)|( TGpPrior Likelihood Posterior
)|,( DTGp
CFGS L
REGB M N
CFGS L
REGB M N
CFGS L
REGB M N
(Note: these are -log probabilities, so lower = better!)
• Contemporary probabilistic modeling provides new tools for formalizing and analyzing stimulus poverty arguments.
• Suggests an alternative to standard linguistic nativism– Hierarchical phrase structure is a crucial constraint on syntactic acquisition, but perhaps need not be
specified innately in the language faculty. It can be inferred from child-directed speech by an ideal learner equipped with innate domain-general capacities to represent grammars of various types and to perform Bayesian inference.
• Many open questions– How close have we come to finding the best grammars of each type? – How would these results extend to richer representations of syntax, or jointly learning multiple levels
of linguistic structure? – What are the prospects for computationally or psychologically realistic learning algorithms that might
approximate this ideal learner?– What is the actual source of hierarchical structure in syntax?
Conclusions: poverty of the stimulus
Conclusions: more general• Bayesian inference over hierarchies of structured representations provides a way
to study core questions of human language in a domain-general framework:– What is the content and form of abstract knowledge? – How can abstract knowledge guide generalization from sparse data? – How can abstract knowledge be acquired? What must be built in?
• A way to move beyond traditional dichotomies.– How can structured knowledge be acquired by statistical learning?– How can domain-general learning mechanisms acquire domain-specific inductive
constraints?
• A different way to think about cognitive development.– Powerful abstractions can be inferred “top down”, from surprisingly little data, together
with learning more concrete knowledge. – Very different from the traditional empiricist or nativist views of abstraction. Worth
pursuing more generally…
– Word Learning Whole object bias Taxonomic principle (Markman)
Shape bias (Smith)
– Causal reasoning Causal schemata (Kelley)
– Folk physics Objects are unified, persistent (Spelke)
– Number Counting principles (Gelman) – Folk biology Principles of taxonomic rank (Atran)
– Folk psychology Principle of rationality (Gergely)
– Ontology M-constraint on predicability (Keil)
– Syntax UG (Chomsky)
– Phonology Faithfulness, Markedness constraints (Prince, Smolensky)
Abstract knowledge in cognitive development