synergies in language acquisition - macquarie...

73
Synergies in Language Acquisition Mark Johnson joint work with Benjamin B¨ orschinger, Katherine Demuth, Michael Frank, Mehdi Parviz, Sharon Goldwater, Tom Griffiths and Bevan Jones Macquarie University Sydney, Australia 1/68

Upload: haquynh

Post on 30-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Synergies in Language Acquisition

Mark Johnson

joint work with Benjamin Borschinger, Katherine Demuth, Michael Frank,Mehdi Parviz, Sharon Goldwater, Tom Griffiths and Bevan Jones

Macquarie UniversitySydney, Australia

1/68

Page 2: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

2/68

Page 3: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

The Janus-faced character of moderncomputational linguistics

Engineering applications (Natural Language Processing):▶ information extraction from large text collections▶ machine translation▶ speech recognition (?)

Scientific side (Computational Linguistics):▶ computation is the manipulation of meaning-bearing symbols in

ways that respect their meaning▶ studies language comprehension, production and acquisition as

computational processesSpectacular (and lucrative) advances in NLP; can they help usunderstand language?

▶ steam engines invented a century before thermodynamics andstatistical mechanics

3/68

Page 4: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Computers as useful tools for neurolinguisticsNeural responses recorded using MEGwhile listening to/reading sciencepodcasts

▶ responses aligned with word onsetsConstruct linear regression modelspredicting neural response as afunction of linguistic predictors foreach spatio-temporal locationLog word frequency is highlypredictive (explains 1.46% ofvariance)Response is primarily 240–400 msec post-stimulusMany brain regions involved, including Wernicke’s area

4/68

Page 5: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Computational models of human syntacticprocessing

Roark’s parser is an incrementalpredictive syntactic parser

▶ uses statistics derived from Penntreebank

Constructs connected syntactic parsetrees on a word-by-word basisThe number of derivational steps toincorporate a word into the parse ishighly predictive (explains 1.91% ofvariance)

Response 320–400 msec post-stimulusActivations in Broca’s area and Wernicke’s area

5/68

Page 6: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

How can computational models help us understandlanguage acquisition?

Most computational linguistics research focuses on parsing orlearning algorithmsA computational model (Marr 1982) of acquisition specifies:

▶ the input (information available to learner)▶ the output (generalisations learner can make)▶ a model that relates input to output

This talk compares:▶ staged learning, which learns one kind of thing at a time, and▶ joint learning, which learns several kinds of things simultaneously,

and demonstrates synergies in acquisition that only joint learnersexploitWe do this by comparing models that differ solely in the kinds ofgeneralisations they can form

6/68

Page 7: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Bayesian learning asan “ideal observer” theory of learning

P(Grammar | Data)︸ ︷︷ ︸Posterior

∝ P(Data | Grammar)︸ ︷︷ ︸Likelihood

P(Grammar)︸ ︷︷ ︸Prior

Likelihood measures how well grammar describes dataPrior expresses knowledge of grammar before data is seen

▶ can be very specific (e.g., Universal Grammar)▶ can be very general (e.g., prefer shorter grammars)

Prior can also express markedness preferences (“soft universals”)Posterior is a product of both likelihood and prior

▶ a grammar must do well on both to have high posterior probabilityPosterior is a distribution over grammars

▶ captures learner’s uncertainty about which grammar is correct

7/68

Page 8: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

The acquisition of the lexicon as non-parametricinference

What has to be learned in order to learn a word?▶ pronunciation (sequence of phonemes)▶ syntactic properties▶ semantic properties (what kinds of things it can refer to)

There are unboundedly many different possible pronunciations (andpossible meanings?)Parametric inference: learn values of a finite number ofparametersNon-parametric inference:

▶ possibly infinite number of parameters▶ learn which parameters are relevant as well as their values

Adaptor grammars use a grammar to generate parameters forlearning (e.g., possible lexical items)

▶ builds on non-parametric hierarchical Bayesian inference

8/68

Page 9: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

9/68

Page 10: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 11: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 12: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 13: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 14: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 15: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by

▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains

Probability of tree is product of probabilities of rules used toconstruct it

Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates

S

NP VP

Sam V NP

likes Sandy

P(Tree) = 1× 0.7× 1× 0.8× 0.310/68

Page 16: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

A CFG for stem-suffix morphologyWord → Stem Suffix Chars → CharStem → Chars Chars → Char CharsSuffix → Chars Char → a | b | c | . . .

WordStemChars

Chart

CharsChara

CharsCharl

CharsChark

SuffixChars

Chari

CharsCharn

CharsCharg

CharsChar#

Grammar’s trees canrepresent any segmentationof words into stems andsuffixes

⇒ Can represent truesegmentationBut grammar’s units ofgeneralization (PCFG rules)are “too small” to learnmorphemes

11/68

Page 17: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

A “CFG” with one rule per possible morphemeWord → Stem SuffixStem → all possible stemsSuffix → all possible suffixes

Word

Stem

t a l k

Suffix

i n g #

Word

Stem

j u m p

Suffix

#

A rule for each morpheme⇒ “PCFG” can represent probability of each morphemeUnbounded number of possible rules, so this is not a PCFG

▶ not a practical problem, as only a finite set of rules could possiblybe used in any given finite data set

12/68

Page 18: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Maximum likelihood estimate for θ is trivialMaximum likelihood selects θ that minimizes KL-divergencebetween model and training data W distributionsSaturated model in which each word is generated by its own rulereplicates training data distribution W exactly

⇒ Saturated model is maximum likelihood estimateMaximum likelihood estimate does not find any suffixes

Word

Stem

t a l k i n g

Suffix

#

13/68

Page 19: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Forcing generalization via sparse priorsIdea: use Bayesian prior that prefers fewer rulesSet of rules is fixed in standard PCFG estimation,but can “turn rule off” by setting θA→β ≈ 0Dirichlet prior with αA→β ≈ 0 prefers θA→β ≈ 0

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

P(θ

1|α

)

Rule probability θ1

α = (1,1)α = (0.5,0.5)

α = (0.25,0.25)α = (0.1,0.1)

14/68

Page 20: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Morphological segmentation experiment

Trained on orthographic verbs from U Penn. Wall Street JournaltreebankUniform Dirichlet prior prefers sparse solutions as α → 0

Gibbs sampler samples from posterior distribution of parses▶ reanalyses each word based on parses of the other words

15/68

Page 21: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Posterior samples from WSJ verb tokensα = 0.1 α = 10−5 α = 10−10 α = 10−15

expect expect expect expectexpects expects expects expects

expected expected expected expectedexpecting expect ing expect ing expect ing

include include include includeincludes includes includ es includ esincluded included includ ed includ ed

including including including includingadd add add add

adds adds adds add sadded added add ed added

adding adding add ing add ingcontinue continue continue continue

continues continues continue s continue scontinued continued continu ed continu ed

continuing continuing continu ing continu ingreport report report report

reports report s report s report sreported reported reported reported

reporting report ing report ing report ingtransport transport transport transport

transports transport s transport s transport stransported transport ed transport ed transport ed

transporting transport ing transport ing transport ingdownsize downsiz e downsiz e downsiz e

downsized downsiz ed downsiz ed downsiz eddownsizing downsiz ing downsiz ing downsiz ing

dwarf dwarf dwarf dwarfdwarfs dwarf s dwarf s dwarf s

dwarfed dwarf ed dwarf ed dwarf edoutlast outlast outlast outlas t

outlasted outlast ed outlast ed outlas ted

16/68

Page 22: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Relative frequencies of inflected verb forms

17/68

Page 23: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Types and tokensA word type is a distinct word shapeA word token is an occurrence of a word

Data = “the cat chased the other cat”Tokens = “the”, “cat”, “chased”, “the”, “other”, “cat”Types = “the”, “cat”, “chased”, “other”

Estimating θ from word types rather than word tokens eliminates(most) frequency variation

▶ 4 common verb suffixes, so when estimating from verb typesθSuffix→i n g # ≈ 0.25

Several psycholinguists believe that humans learn morphology fromword typesAdaptor grammar mimics Goldwater et al “Interpolating betweenTypes and Tokens” morphology-learning model

18/68

Page 24: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Posterior samples from WSJ verb typesα = 0.1 α = 10−5 α = 10−10 α = 10−15

expect expect expect exp ectexpects expect s expect s exp ects

expected expect ed expect ed exp ectedexpect ing expect ing expect ing exp ectinginclude includ e includ e includ einclude s includ es includ es includ es

included includ ed includ ed includ edincluding includ ing includ ing includ ing

add add add addadds add s add s add sadd ed add ed add ed add ed

adding add ing add ing add ingcontinue continu e continu e continu econtinue s continu es continu es continu escontinu ed continu ed continu ed continu ed

continuing continu ing continu ing continu ingreport report repo rt rep ort

reports report s repo rts rep ortsreported report ed repo rted rep orted

report ing report ing repo rting rep ortingtransport transport transport transporttransport s transport s transport s transport stransport ed transport ed transport ed transport ed

transporting transport ing transport ing transport ingdownsize downsiz e downsi ze downsi zedownsiz ed downsiz ed downsi zed downsi zeddownsiz ing downsiz ing downsi zing downsi zing

dwarf dwarf dwarf dwarfdwarf s dwarf s dwarf s dwarf sdwarf ed dwarf ed dwarf ed dwarf ed

outlast outlast outlas t outla stoutlasted outlas ted outla sted

19/68

Page 25: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Desiderata for an extension of PCFGs

PCFG rules are “too small” to be effective units of generalization⇒ generalize over groups of rules⇒ units of generalization should be chosen based on dataType-based inference mitigates over-dispersion⇒ Hierarchical Bayesian model where:

▶ context-free rules generate types▶ another process replicates the types to generate tokens

Adaptor grammars:▶ learn probability of entire subtrees (how a nonterminal expands to

terminals)▶ use grammatical hierarchy to define a Bayesian hierarchy, from

which type-based inference naturally emerges

20/68

Page 26: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Properties of adaptor grammars

Probability of reusing an adapted subtree TA∝ number of times TA was previously generated

▶ adapted subtrees are not independent– an adapted subtree can be more probable than the rules used

to construct it▶ but they are exchangable ⇒ efficient sampling algorithms▶ “rich get richer” ⇒ Zipf power-law distributions

Each adapted nonterminal is associated with aChinese Restaurant Process or Pitman-Yor Process

▶ CFG rules define base distribution of CRP or PYPCRP/PYP parameters (e.g., αA) can themselves be estimated(e.g., slice sampling), so AG models have no tunable parameters

21/68

Page 27: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Bayesian hierarchy inverts grammatical hierarchy

Grammatically, a Word iscomposed of a Stem and aSuffix, which are composedof CharsTo generate a new Wordfrom an Adaptor Grammar:

▶ reuse an old Word, or▶ generate a fresh one from

the base distribution, i.e.,generate a Stem and aSuffix

WordStemChars

Chart

CharsChara

CharsCharl

CharsChark

SuffixChars

Chari

CharsCharn

CharsCharg

CharsChar#

Lower in the tree ⇒ higher in Bayesian hierarchy

22/68

Page 28: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Learning an adaptor grammar

Sampling algorithms are a natural implementation because theysample the parameters (tree fragments) as well as their values

▶ sufficient statistics are number of times each rule and treefragment is used in a derivation

Gibbs sampler for learning an AG from strings alone:1. select a training sentence (e.g., at random)2. (subtract sentence’s rule and fragment counts if parsed before)3. sample a parse for sentence given parses of other sentences

– PCFG proposal distribution and MH correction4. add new parse’s rule and tree fragment counts to global totals▶ repeat steps 1.–4. forever

Computationally-intensive step is parsing sentences of training data

23/68

Page 29: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

24/68

Page 30: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Unsupervised word segmentation

Input: phoneme sequences with sentence boundaries (Brent)Task: identify word boundaries, and hence words

j △ u ▲ w △ ɑ △ n △ t ▲ t △ u ▲ s △ i ▲ ð △ ə ▲ b △ ʊ △ k“you want to see the book”

Ignoring phonology and morphology, this involves learning thepronunciations of the lexical items in the language

25/68

Page 31: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

PCFG model of word segmentation

Words → WordWords → Word WordsWord → PhonsPhons → PhonPhons → Phon PhonsPhon → a | b | . . .

CFG trees can describesegmentation, butPCFGs can’t distinguish goodsegmentations from bad ones

▶ PCFG rules are too small a unit of generalisation▶ need to learn e.g., probability that bʊk is a Word

Words

Word

Phons

Phon

ð

Phons

Phon

ə

Words

Word

Phons

Phon

b

Phons

Phon

ʊ

Phons

Phon

k

26/68

Page 32: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Towards non-parametric grammars

Words → WordWords → Word WordsWord → all possible phoneme sequences

Learn probability Word → b ʊ kBut infinitely many possible Word expansions⇒ this grammar is not a PCFG

Words

Word

ð ə

Words

Word

b ʊ k

Given fixed training data, only finitely many useful rules⇒ use data to choose Word rules as well as their probabilities

An adaptor grammar can do precisely this!

27/68

Page 33: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Unigram adaptor grammar (Brent)

Words → WordWords → Word WordsWord → PhonsPhons → PhonPhons → Phon Phons

Words

Word

Phons

Phon

ð

Phons

Phon

ə

Words

Word

Phons

Phon

b

Phons

Phon

ʊ

Phons

Phon

k

Word nonterminal is adapted⇒ To generate a Word:

▶ select a previously generated Word subtreewith prob. ∝ number of times it has been generated

▶ expand using Word → Phons rule with prob. ∝ αWordand recursively expand Phons

28/68

Page 34: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Unigram model of word segmentationUnigram “bag of words” model (Brent):

▶ generate a dictionary, i.e., a set of words, where each word is arandom sequence of phonemes

– Bayesian prior prefers smaller dictionaries▶ generate each utterance by choosing each word at random from

dictionaryBrent’s unigram model as an adaptor grammar:

Words → Word+

Word → Phoneme+Words

Wordj u

Wordw ɑ n t

Wordt u

Words i

Wordð ə

Wordb ʊ k

Accuracy of word segmentation learnt: 56% token f-score(same as Brent model)But we can construct many more word segmentation models usingAGs

29/68

Page 35: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Adaptor grammar learnt from Brent corpusInitial grammar

1 Words → Word Words 1 Words → Word1 Word → Phon1 Phons → Phon Phons 1 Phons → Phon1 Phon → D 1 Phon → G1 Phon → A 1 Phon → E

A grammar learnt from Brent corpus16625 Words → Word Words 9791 Words → Word1575 Word → Phons4962 Phons → Phon Phons 1575 Phons → Phon134 Phon → D 41 Phon → G180 Phon → A 152 Phon → E460 Word → (Phons (Phon y) (Phons (Phon u)))446 Word → (Phons (Phon w) (Phons (Phon A) (Phons (Phon t))))374 Word → (Phons (Phon D) (Phons (Phon 6)))372 Word → (Phons (Phon &) (Phons (Phon n) (Phons (Phon d))))

30/68

Page 36: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Undersegmentation errors with Unigram modelWords → Word+ Word → Phon+

Unigram word segmentation model assumes each word is generatedindependentlyBut there are strong inter-word dependencies (collocations)Unigram model can only capture such dependencies by analyzingcollocations as words (Goldwater 2006)

WordsWordt ɛi k

Wordð ə d ɑ g i

Wordɑu t

WordsWord

j u w ɑ n t t uWord

s i ð əWord

b ʊ k31/68

Page 37: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Accuracy of unigram model

Number of training sentences

Toke

n f-s

core

0.0

0.2

0.4

0.6

0.8

1 10 100 1000 10000

32/68

Page 38: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Boundary precision of unigram model

Number of training sentences

Bou

ndar

y P

reci

sion

0.0

0.2

0.4

0.6

0.8

1.0

1 10 100 1000 10000

Boundary recall of unigram model

Number of training sentences

Bou

ndar

y R

ecal

l

0.0

0.2

0.4

0.6

0.8

1 10 100 1000 10000

33/68

Page 39: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Collocations ⇒ WordsSentence → Colloc+Colloc → Word+

Word → Phon+

Sentence

Colloc

Word

j u

Word

w ɑ n t t u

Colloc

Word

s i

Colloc

Word

ð ə

Word

b ʊ kA Colloc(ation) consists of one or more wordsBoth Words and Collocs are adapted (learnt)Significantly improves word segmentation accuracy over unigrammodel (76% f-score; ≈ Goldwater’s bigram model)

34/68

Page 40: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

35/68

Page 41: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Two hypotheses about language acquisition1. Pre-programmed staged acquisition of linguistic components

▶ Conventional view of lexical acquisition, e.g., Kuhl (2004)– child first learns the phoneme inventory, which it then uses to

learn– phonotactic cues for word segmentation, which are used to

learn– phonological forms of words in the lexicon, . . .

2. Interactive acquisition of all linguistic components together▶ corresponds to joint inference for all components of language▶ stages in language acquisition might be due to:

– child’s input may contain more information about somecomponents

– some components of language may be learnable with less data

36/68

Page 42: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Synergies: an advantage of interactive learning

An interactive learner can take advantage of synergies inacquisition

▶ partial knowledge of component A provides information aboutcomponent B

▶ partial knowledge of component B provides information aboutcomponent A

A staged learner can only take advantage of one of thesedependenciesAn interactive or joint learner can benefit from a positive feedbackcycle between A and BAre there synergies in learning how to segment words andidentifying the referents of words?

37/68

Page 43: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Jointly learning words and syllablesSentence → Colloc+ Colloc → Word+

Word → Syllable{1:3} Syllable → (Onset) RhymeOnset → Consonant+ Rhyme → Nucleus (Coda)Nucleus → Vowel+ Coda → Consonant+

SentenceColloc

WordOnset

lNucleus

ʊCodak

WordNucleus

æCoda

t

CollocWord

Onsetð

Nucleusɪ

Codas

Rudimentary syllable model (an improved model might do better)With 2 Collocation levels, f-score = 84%

38/68

Page 44: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Distinguishing internal onsets/codas helpsSentence → Colloc+ Colloc → Word+

Word → SyllableIF Word → SyllableI SyllableFWord → SyllableI Syllable SyllableF SyllableIF → (OnsetI) RhymeFOnsetI → Consonant+ RhymeF → Nucleus (CodaF)Nucleus → Vowel+ CodaF → Consonant+

SentenceCollocWord

OnsetIh

Nucleusæ

CodaFv

CollocWordNucleus

ə

WordOnsetId r

Nucleusɪ

CodaFŋ k

With 2 Collocation levels, not distinguishing initial/final clusters,f-score = 84%With 3 Collocation levels, distinguishing initial/final clusters,f-score = 87%

39/68

Page 45: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Collocations2 ⇒ Words ⇒ Syllables

Sentence

Colloc2

Colloc

Word

OnsetI

g

Nucleus

ɪ

CodaF

v

Word

OnsetI

h

Nucleus

ɪ

CodaF

m

Colloc

Word

Nucleus

ə

Word

OnsetI

k

Nucleus

ɪ

CodaF

s

Colloc2

Colloc

Word

Nucleus

o

Word

OnsetI

k

Nucleus

e

40/68

Page 46: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Accuracy of Collocation + Syllable model

Number of training sentences

Toke

n f-s

core

0.0

0.2

0.4

0.6

0.8

1.0

1 10 100 1000 10000

41/68

Page 47: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Accuracy of Collocation + Syllable model by word frequency

Number of training sentences

Acc

urac

y (f-

scor

e)

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

10-20

200-500

1 10 100 1000 10000

20-50

500-1000

1 10 100 1000 10000

50-100

1000-2000

1 10 100 1000 10000

100-200

1 10 100 1000 10000

42/68

Page 48: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Interaction between syllable phonotactics andsegmentation

Word segmentation accuracy depends on the kinds ofgeneralisations the model can learn

words as units (unigram) 56%+ associations between words (collocations) 76%+ syllable structure 84%+ interaction between

segmentation and syllable structure 87%

Synergies in learning words and syllable structure▶ joint inference permits the learner to explain away potentially

misleading generalizations

43/68

Page 49: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Exploiting stress in word segmentationStress is the “accentuation of syllables within words”

▶ 2-syllable words with initial stress: GIant, PICture, HEAting▶ 2-syllable words with final stress: toDAY, aHEAD, aLLOW

English has a strong preference for initial stress (Cutler 1987)▶ 50% of tokens / 85% of types have initial stress▶ but: 50% of tokens / 5% of types are unstressed

Strong evidence that English-speaking children use stress for wordsegmentationData preparation: stress marked on vowel nucleus

j △ u ▲ w △ ɑ* △ n △ t ▲ t △ u ▲ s △ i* ▲ ð △ ə ▲ b △ ʊ* △ k“you want to see the book”

▶ c.f. Johnson and Demuth (2010) tone annotation in Chinese▶ function words are unstressed (contra Yang and others)

44/68

Page 50: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Learning stress patterns with AGs

Grammar can represent all possible stress patterns (up to 4syllables)Stress pattern probabilities learned jointly with phonotactics andsegmentation

WordStressedUnstressed

SSyllOnset

pSRhyme

SNucleusɪ *

Codak

USyllOnsett ʃ

UNucleusə

Codar

WordUnstressedStressedUSyll

Onsett

URhymeUNucleus

ə

SSyllOnset

dSRhymeSNucleus

e *

45/68

Page 51: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Stress and phonotactics in word segmentationModels differ only in kinds of generalisations they can form

▶ phonotactic models learn generalisations about word edges▶ stress models learn probability of strong/weak sequences

Model Accuracycollocations + syllable structure 0.81+ phonotactic cues 0.85+ stress 0.86+ both 0.88

Token f-score on the Alex portion of the Providence corpusBoth phonotactics and stress are useful cues for word segmentationPerformance improves when both are used ⇒ complementary cuesfor word segmentation

46/68

Page 52: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Stress and phonotactics over time

0.65

0.70

0.75

0.80

0.85

100 200 500 1000 2000 5000 10000number of input utterances

tokenf-score

colloc3-nophoncolloc3-phoncolloc3-nophon-stresscolloc3-phon-stress

Joint stress+phonotactic model is best with small dataModels with either eventually catch up

47/68

Page 53: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

More on learning stress

Probability of initial stress and unstressed word rules rapidlyconverges on their type frequencies in the dataConsistently underestimates probability of stress-second patterns(true type frequency = 0.07, estimated type frequency = 0.04)

▶ stress-second is also problematic for English childrenProbability of word rules with more than one stress approacheszero as data grows⇒ Unique stress constraint (Yang 2004) can be acquired

48/68

Page 54: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

49/68

Page 55: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Prior work: mapping words to topics

Input to learner:▶ word sequence: Is that the pig?▶ objects in nonlinguistic context: dog, pig

Learning objectives:▶ identify utterance topic: pig▶ identify word-topic mapping: pig 7→ pig

50/68

Page 56: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Frank et al (2009) “topic models” as PCFGsPrefix sentences with possibletopic marker, e.g., pig|dogPCFG rules choose a topic fromtopic marker and propagate itthrough sentenceEach word is either generatedfrom sentence topic or nulltopic ∅

SentenceTopicpig

TopicpigTopicpig

TopicpigTopicpigpig|dog

Word∅is

Word∅that

Word∅the

Wordpigpig

Grammar can require at most one topical word per sentenceBayesian inference for PCFG rules and trees corresponds toBayesian inference for word and sentence topics using topic model(Johnson 2010)

51/68

Page 57: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

AGs for joint segmentation and topic-mappingCombine topic-model PCFG with word segmentation AGsInput consists of unsegmented phonemic forms prefixed withpossible topics:

pig|dog ɪ z ð æ t ð ə p ɪ g

E.g., combination of Frank “topic model”and unigram segmentation model

▶ equivalent to Jones et al (2010)

Easy to define othercombinations of topicmodels andsegmentation models

SentenceTopicpig

TopicpigTopicpig

TopicpigTopicpigpig|dog

Word∅ɪ z

Word∅ð æ t

Word∅ð ə

Wordpigp ɪ g

52/68

Page 58: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Collocation topic model AG

Sentence

TopicpigTopicpig

Topicpigpig|dog

Colloc∅Word∅ɪ z

Word∅ð æ t

CollocpigWord∅ð ə

Wordpigp ɪ g

Collocations are either “topical” or notEasy to modify this grammar so

▶ at most one topical word per sentence, or▶ at most one topical word per topical collocation

53/68

Page 59: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Does non-linguistic context help segmentation?Model word segmentation

segmentation topics token f-scoreunigram not used 0.533unigram any number 0.537unigram one per sentence 0.547

collocation not used 0.695collocation any number 0.726collocation one per sentence 0.719collocation one per collocation 0.750

Not much improvement with unigram model▶ consistent with results from Jones et al (2010)

Larger improvement with collocation model▶ most gain with one topical word per topical collocation

(this constraint cannot be imposed on unigram model)

54/68

Page 60: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Does better segmentation help topic identification?Task: identify object (if any) this sentence is about

Model sentence topicsegmentation topics accuracy f-score

unigram not used 0.709 0unigram any number 0.702 0.355unigram one per sentence 0.503 0.495

collocation not used 0.709 0collocation any number 0.728 0.280collocation one per sentence 0.440 0.493collocation one per collocation 0.839 0.747

The collocation grammar with one topical word per topicalcollocation is the only model clearly better than baseline

55/68

Page 61: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Does better segmentation help learning word-topicmappings?

Task: identify head nouns of NPs referring to topical objects(e.g. pɪg 7→ pig in input pig | dog ɪ z ð æ t ð ə p ɪ g)

Model topical wordsegmentation topics f-score

unigram not used 0unigram any number 0.149unigram one per sentence 0.147

collocation not used 0collocation any number 0.220collocation one per sentence 0.321collocation one per collocation 0.636

The collocation grammar with one topical word per topicalcollocation is best at identifying head nouns of topical NPs

56/68

Page 62: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Accuracy as a function of grammar and topicality

Number of training sentences

Acc

urac

y (f-

scor

e)

0.0

0.2

0.4

0.6

0.8

1.0

1 10 100 1000 10000

Grammar/Topicality

colloc Not_topical

colloc Topical

Tcolloc1 Not_topical

Tcolloc1 Topical

57/68

Page 63: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Summary of jointly learning word segmentationand word-to-topic mappings

Word to object mapping is learnt more accurately when words aresegmented more accurately

▶ improving segmentation accuracy improves topic detection andacquisition of topical words

Word segmentation accuracy improves when exploitingnon-linguistic context information

▶ incorporating word-topic mapping improves segmentation accuracy(at least with collocation grammars)

⇒ There are synergies a learner can exploit when learning wordsegmentation and word-object mappings

58/68

Page 64: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

59/68

Page 65: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Social cues and word-topic mappingSocial interactions are important for early language acquisitionCan computational models exploit social cues?

▶ we show this by building models that can exploit social cues, andshow they learns better word-topic mappings on data with socialcues than when social cues are removed

▶ no evidence that social cues improve word-segmentation accuracyOur models learn relative importance of different social cues

▶ estimate probability of each cue occuring with “topical objects”and probability of each cue occuring with “non-topical objects”

▶ they do this in an unsupervised way, i.e., they are not told whichobjects are topical

▶ ablation tests show that eye-gaze is the most important social cuefor learning word-topic mappings

60/68

Page 66: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Function words in word segmentation

Some psychologists believe children exploit function words in earlylanguage acquisition

▶ function words often are high frequency and phonologically simple⇒ easy to learn?

▶ function words typically appear in phrase-peripheral positions⇒ provide “anchors” for word and phrase segmentation

Modify word segmentation grammar to optionally generatesequences of mono-syllabic “function words” at collocation edges

▶ improves word segmentation f-score from 0.87 to 0.92Model can learn directionality of function word attachment

▶ Bayes factor hypothesis test overwhelmingly prefers left to rightfunction word attachment in English

61/68

Page 67: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Jointly learning word segmentation andphonological alternation

Word segmentation models so far don’t account for phonologicalalternations (e.g., final devoicing, /t/-deletion, etc.)

▶ fundamental operation in CFG is string concatenation▶ no principled reason why adaptor grammars can’t be combined

with phonological operationsBorschinger, Johnson and Demuth (2013) generalises Goldwater’sbigram word segmentation model to allow word-final /t/-deletion

▶ applied to Buckeye corpus of adult speechCurrent work: incorporate a MaxEnt “Harmony theory” model ofphonological alternation

62/68

Page 68: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

On-line learning using particle filtersThe Adaptor Grammar software uses a batch algorithm thatrepeatedly parses the data

▶ in principle, all the algorithm requires is a source of randomsamples from the training data

⇒ Adaptor Grammars can be learned on-lineParticle filters are a standard technique for on-line Bayesianinference

▶ a particle filter updates multiple analyses in parallelBorschinger and Johnson (2011, 2012) explore on-line particle filteralgorithms for Goldwater’s bigram model

▶ a particle filter needs tens of thousands of particles to approachthe Metropolis-within-Gibbs algorithm used for Adaptor Grammarshere

▶ adding a rejuvenation step reduces the number of particles neededdramatically

63/68

Page 69: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Outline

IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion

64/68

Page 70: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Conclusions and future workJoint learning often uses information in the input more effectivelythan staged learning

▶ Learning syllable structure and word segmentation▶ Learning word-topic associations and word segmentation

Do children exploit such synergies in language acquisition?Adaptor grammars are a flexible framework for statingnon-parametric hierarchical Bayesian models

▶ the accuracies obtained here are the best reported in the literatureFuture work: make the models more realistic

▶ extend expressive power of AGs (e.g., incorporatingMaxEnt/Harmony-theory components)

▶ richer data (e.g., more non-linguistic context)▶ more realistic data (e.g., phonological variation)

65/68

Page 71: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

How specific should our computational models be?Marr’s (1982) three levels of computational models:

▶ computational level (inputs, outputs and relation between them)▶ algorithmic level (steps involved in mapping from input to output)▶ implementation level (physical processes involved)

Algorithmic-level models are extremely popular, but I think weshould focus on computational-level models first

▶ we know almost nothing about how hierarchical structures arerepresented and manipulated in the brain

⇒ we know almost nothing about which data structures andoperations are neurologically plausible

▶ current models only explain a tiny fraction of language processingor acquisition

▶ typically computational models can be extended, while algorithmsneed to be completely changed

⇒ today’s computational models have a greater chance of beingrelevant than today’s algorithms

66/68

Page 72: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

Why a child’s learning algorithm may be nothinglike our algorithms

Enormous differences in “hardware” ⇒ different feasible algorithmsAs scientists we need generic learning algorithms, but a child onlyneeds a specific learning algorithm

▶ as scientists we want to study the effects of different modellingassumptions on learning

⇒ we need generic algorithms that work for a range of differentmodels, so we can compare them

▶ a child only needs an algorithm that works for whatever modelthey have

⇒ the child’s algorithm might be specialised to their model, and neednot work at all for other kinds of models

The field of machine learning has developed many generic learningalgorithms: Expectation-maximisation, variational Bayes, Markovchain Monte Carlo, Gibbs samplers, particle filters, . . .

67/68

Page 73: Synergies in Language Acquisition - Macquarie Universityweb.science.mq.edu.au/~mjohnson/papers/Johnson14Synergies-talk.pdf · Synergies in Language Acquisition ... A computational

The future of Bayesian models of languageacquisition

P(Grammar | Data)︸ ︷︷ ︸Posterior

∝ P(Data | Grammar)︸ ︷︷ ︸Likelihood

P(Grammar)︸ ︷︷ ︸Prior

So far our grammars and priors don’t encode much linguisticknowledge, but in principle they can!

▶ how do we represent this knowledge?▶ how can we learn efficiently using this knowledge?

Should permit us to empirically investigate effects of specificuniversals on the course of language acquisitionMy guess: the interaction between innate knowledge and learningwill be richer and more interesting than either the rationalists orempiricists currently imagine!

68/68