synergies in language acquisition - macquarie...
TRANSCRIPT
Synergies in Language Acquisition
Mark Johnson
joint work with Benjamin Borschinger, Katherine Demuth, Michael Frank,Mehdi Parviz, Sharon Goldwater, Tom Griffiths and Bevan Jones
Macquarie UniversitySydney, Australia
1/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
2/68
The Janus-faced character of moderncomputational linguistics
Engineering applications (Natural Language Processing):▶ information extraction from large text collections▶ machine translation▶ speech recognition (?)
Scientific side (Computational Linguistics):▶ computation is the manipulation of meaning-bearing symbols in
ways that respect their meaning▶ studies language comprehension, production and acquisition as
computational processesSpectacular (and lucrative) advances in NLP; can they help usunderstand language?
▶ steam engines invented a century before thermodynamics andstatistical mechanics
3/68
Computers as useful tools for neurolinguisticsNeural responses recorded using MEGwhile listening to/reading sciencepodcasts
▶ responses aligned with word onsetsConstruct linear regression modelspredicting neural response as afunction of linguistic predictors foreach spatio-temporal locationLog word frequency is highlypredictive (explains 1.46% ofvariance)Response is primarily 240–400 msec post-stimulusMany brain regions involved, including Wernicke’s area
4/68
Computational models of human syntacticprocessing
Roark’s parser is an incrementalpredictive syntactic parser
▶ uses statistics derived from Penntreebank
Constructs connected syntactic parsetrees on a word-by-word basisThe number of derivational steps toincorporate a word into the parse ishighly predictive (explains 1.91% ofvariance)
Response 320–400 msec post-stimulusActivations in Broca’s area and Wernicke’s area
5/68
How can computational models help us understandlanguage acquisition?
Most computational linguistics research focuses on parsing orlearning algorithmsA computational model (Marr 1982) of acquisition specifies:
▶ the input (information available to learner)▶ the output (generalisations learner can make)▶ a model that relates input to output
This talk compares:▶ staged learning, which learns one kind of thing at a time, and▶ joint learning, which learns several kinds of things simultaneously,
and demonstrates synergies in acquisition that only joint learnersexploitWe do this by comparing models that differ solely in the kinds ofgeneralisations they can form
6/68
Bayesian learning asan “ideal observer” theory of learning
P(Grammar | Data)︸ ︷︷ ︸Posterior
∝ P(Data | Grammar)︸ ︷︷ ︸Likelihood
P(Grammar)︸ ︷︷ ︸Prior
Likelihood measures how well grammar describes dataPrior expresses knowledge of grammar before data is seen
▶ can be very specific (e.g., Universal Grammar)▶ can be very general (e.g., prefer shorter grammars)
Prior can also express markedness preferences (“soft universals”)Posterior is a product of both likelihood and prior
▶ a grammar must do well on both to have high posterior probabilityPosterior is a distribution over grammars
▶ captures learner’s uncertainty about which grammar is correct
7/68
The acquisition of the lexicon as non-parametricinference
What has to be learned in order to learn a word?▶ pronunciation (sequence of phonemes)▶ syntactic properties▶ semantic properties (what kinds of things it can refer to)
There are unboundedly many different possible pronunciations (andpossible meanings?)Parametric inference: learn values of a finite number ofparametersNon-parametric inference:
▶ possibly infinite number of parameters▶ learn which parameters are relevant as well as their values
Adaptor grammars use a grammar to generate parameters forlearning (e.g., possible lexical items)
▶ builds on non-parametric hierarchical Bayesian inference
8/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
9/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
Probabilistic context-free grammarsProbabilistic context-free grammars (PCFGs) define probabilitydistributions over treesEach nonterminal node expands by
▶ choosing a rule expanding that nonterminal, and▶ recursively expanding any nonterminal children it contains
Probability of tree is product of probabilities of rules used toconstruct it
Probability θr Rule r1 S → NP VP0.7 NP → Sam0.3 NP → Sandy1 VP → V NP0.8 V → likes0.2 V → hates
S
NP VP
Sam V NP
likes Sandy
P(Tree) = 1× 0.7× 1× 0.8× 0.310/68
A CFG for stem-suffix morphologyWord → Stem Suffix Chars → CharStem → Chars Chars → Char CharsSuffix → Chars Char → a | b | c | . . .
WordStemChars
Chart
CharsChara
CharsCharl
CharsChark
SuffixChars
Chari
CharsCharn
CharsCharg
CharsChar#
Grammar’s trees canrepresent any segmentationof words into stems andsuffixes
⇒ Can represent truesegmentationBut grammar’s units ofgeneralization (PCFG rules)are “too small” to learnmorphemes
11/68
A “CFG” with one rule per possible morphemeWord → Stem SuffixStem → all possible stemsSuffix → all possible suffixes
Word
Stem
t a l k
Suffix
i n g #
Word
Stem
j u m p
Suffix
#
A rule for each morpheme⇒ “PCFG” can represent probability of each morphemeUnbounded number of possible rules, so this is not a PCFG
▶ not a practical problem, as only a finite set of rules could possiblybe used in any given finite data set
12/68
Maximum likelihood estimate for θ is trivialMaximum likelihood selects θ that minimizes KL-divergencebetween model and training data W distributionsSaturated model in which each word is generated by its own rulereplicates training data distribution W exactly
⇒ Saturated model is maximum likelihood estimateMaximum likelihood estimate does not find any suffixes
Word
Stem
t a l k i n g
Suffix
#
13/68
Forcing generalization via sparse priorsIdea: use Bayesian prior that prefers fewer rulesSet of rules is fixed in standard PCFG estimation,but can “turn rule off” by setting θA→β ≈ 0Dirichlet prior with αA→β ≈ 0 prefers θA→β ≈ 0
0
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1
P(θ
1|α
)
Rule probability θ1
α = (1,1)α = (0.5,0.5)
α = (0.25,0.25)α = (0.1,0.1)
14/68
Morphological segmentation experiment
Trained on orthographic verbs from U Penn. Wall Street JournaltreebankUniform Dirichlet prior prefers sparse solutions as α → 0
Gibbs sampler samples from posterior distribution of parses▶ reanalyses each word based on parses of the other words
15/68
Posterior samples from WSJ verb tokensα = 0.1 α = 10−5 α = 10−10 α = 10−15
expect expect expect expectexpects expects expects expects
expected expected expected expectedexpecting expect ing expect ing expect ing
include include include includeincludes includes includ es includ esincluded included includ ed includ ed
including including including includingadd add add add
adds adds adds add sadded added add ed added
adding adding add ing add ingcontinue continue continue continue
continues continues continue s continue scontinued continued continu ed continu ed
continuing continuing continu ing continu ingreport report report report
reports report s report s report sreported reported reported reported
reporting report ing report ing report ingtransport transport transport transport
transports transport s transport s transport stransported transport ed transport ed transport ed
transporting transport ing transport ing transport ingdownsize downsiz e downsiz e downsiz e
downsized downsiz ed downsiz ed downsiz eddownsizing downsiz ing downsiz ing downsiz ing
dwarf dwarf dwarf dwarfdwarfs dwarf s dwarf s dwarf s
dwarfed dwarf ed dwarf ed dwarf edoutlast outlast outlast outlas t
outlasted outlast ed outlast ed outlas ted
16/68
Relative frequencies of inflected verb forms
17/68
Types and tokensA word type is a distinct word shapeA word token is an occurrence of a word
Data = “the cat chased the other cat”Tokens = “the”, “cat”, “chased”, “the”, “other”, “cat”Types = “the”, “cat”, “chased”, “other”
Estimating θ from word types rather than word tokens eliminates(most) frequency variation
▶ 4 common verb suffixes, so when estimating from verb typesθSuffix→i n g # ≈ 0.25
Several psycholinguists believe that humans learn morphology fromword typesAdaptor grammar mimics Goldwater et al “Interpolating betweenTypes and Tokens” morphology-learning model
18/68
Posterior samples from WSJ verb typesα = 0.1 α = 10−5 α = 10−10 α = 10−15
expect expect expect exp ectexpects expect s expect s exp ects
expected expect ed expect ed exp ectedexpect ing expect ing expect ing exp ectinginclude includ e includ e includ einclude s includ es includ es includ es
included includ ed includ ed includ edincluding includ ing includ ing includ ing
add add add addadds add s add s add sadd ed add ed add ed add ed
adding add ing add ing add ingcontinue continu e continu e continu econtinue s continu es continu es continu escontinu ed continu ed continu ed continu ed
continuing continu ing continu ing continu ingreport report repo rt rep ort
reports report s repo rts rep ortsreported report ed repo rted rep orted
report ing report ing repo rting rep ortingtransport transport transport transporttransport s transport s transport s transport stransport ed transport ed transport ed transport ed
transporting transport ing transport ing transport ingdownsize downsiz e downsi ze downsi zedownsiz ed downsiz ed downsi zed downsi zeddownsiz ing downsiz ing downsi zing downsi zing
dwarf dwarf dwarf dwarfdwarf s dwarf s dwarf s dwarf sdwarf ed dwarf ed dwarf ed dwarf ed
outlast outlast outlas t outla stoutlasted outlas ted outla sted
19/68
Desiderata for an extension of PCFGs
PCFG rules are “too small” to be effective units of generalization⇒ generalize over groups of rules⇒ units of generalization should be chosen based on dataType-based inference mitigates over-dispersion⇒ Hierarchical Bayesian model where:
▶ context-free rules generate types▶ another process replicates the types to generate tokens
Adaptor grammars:▶ learn probability of entire subtrees (how a nonterminal expands to
terminals)▶ use grammatical hierarchy to define a Bayesian hierarchy, from
which type-based inference naturally emerges
20/68
Properties of adaptor grammars
Probability of reusing an adapted subtree TA∝ number of times TA was previously generated
▶ adapted subtrees are not independent– an adapted subtree can be more probable than the rules used
to construct it▶ but they are exchangable ⇒ efficient sampling algorithms▶ “rich get richer” ⇒ Zipf power-law distributions
Each adapted nonterminal is associated with aChinese Restaurant Process or Pitman-Yor Process
▶ CFG rules define base distribution of CRP or PYPCRP/PYP parameters (e.g., αA) can themselves be estimated(e.g., slice sampling), so AG models have no tunable parameters
21/68
Bayesian hierarchy inverts grammatical hierarchy
Grammatically, a Word iscomposed of a Stem and aSuffix, which are composedof CharsTo generate a new Wordfrom an Adaptor Grammar:
▶ reuse an old Word, or▶ generate a fresh one from
the base distribution, i.e.,generate a Stem and aSuffix
WordStemChars
Chart
CharsChara
CharsCharl
CharsChark
SuffixChars
Chari
CharsCharn
CharsCharg
CharsChar#
Lower in the tree ⇒ higher in Bayesian hierarchy
22/68
Learning an adaptor grammar
Sampling algorithms are a natural implementation because theysample the parameters (tree fragments) as well as their values
▶ sufficient statistics are number of times each rule and treefragment is used in a derivation
Gibbs sampler for learning an AG from strings alone:1. select a training sentence (e.g., at random)2. (subtract sentence’s rule and fragment counts if parsed before)3. sample a parse for sentence given parses of other sentences
– PCFG proposal distribution and MH correction4. add new parse’s rule and tree fragment counts to global totals▶ repeat steps 1.–4. forever
Computationally-intensive step is parsing sentences of training data
23/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
24/68
Unsupervised word segmentation
Input: phoneme sequences with sentence boundaries (Brent)Task: identify word boundaries, and hence words
j △ u ▲ w △ ɑ △ n △ t ▲ t △ u ▲ s △ i ▲ ð △ ə ▲ b △ ʊ △ k“you want to see the book”
Ignoring phonology and morphology, this involves learning thepronunciations of the lexical items in the language
25/68
PCFG model of word segmentation
Words → WordWords → Word WordsWord → PhonsPhons → PhonPhons → Phon PhonsPhon → a | b | . . .
CFG trees can describesegmentation, butPCFGs can’t distinguish goodsegmentations from bad ones
▶ PCFG rules are too small a unit of generalisation▶ need to learn e.g., probability that bʊk is a Word
Words
Word
Phons
Phon
ð
Phons
Phon
ə
Words
Word
Phons
Phon
b
Phons
Phon
ʊ
Phons
Phon
k
26/68
Towards non-parametric grammars
Words → WordWords → Word WordsWord → all possible phoneme sequences
Learn probability Word → b ʊ kBut infinitely many possible Word expansions⇒ this grammar is not a PCFG
Words
Word
ð ə
Words
Word
b ʊ k
Given fixed training data, only finitely many useful rules⇒ use data to choose Word rules as well as their probabilities
An adaptor grammar can do precisely this!
27/68
Unigram adaptor grammar (Brent)
Words → WordWords → Word WordsWord → PhonsPhons → PhonPhons → Phon Phons
Words
Word
Phons
Phon
ð
Phons
Phon
ə
Words
Word
Phons
Phon
b
Phons
Phon
ʊ
Phons
Phon
k
Word nonterminal is adapted⇒ To generate a Word:
▶ select a previously generated Word subtreewith prob. ∝ number of times it has been generated
▶ expand using Word → Phons rule with prob. ∝ αWordand recursively expand Phons
28/68
Unigram model of word segmentationUnigram “bag of words” model (Brent):
▶ generate a dictionary, i.e., a set of words, where each word is arandom sequence of phonemes
– Bayesian prior prefers smaller dictionaries▶ generate each utterance by choosing each word at random from
dictionaryBrent’s unigram model as an adaptor grammar:
Words → Word+
Word → Phoneme+Words
Wordj u
Wordw ɑ n t
Wordt u
Words i
Wordð ə
Wordb ʊ k
Accuracy of word segmentation learnt: 56% token f-score(same as Brent model)But we can construct many more word segmentation models usingAGs
29/68
Adaptor grammar learnt from Brent corpusInitial grammar
1 Words → Word Words 1 Words → Word1 Word → Phon1 Phons → Phon Phons 1 Phons → Phon1 Phon → D 1 Phon → G1 Phon → A 1 Phon → E
A grammar learnt from Brent corpus16625 Words → Word Words 9791 Words → Word1575 Word → Phons4962 Phons → Phon Phons 1575 Phons → Phon134 Phon → D 41 Phon → G180 Phon → A 152 Phon → E460 Word → (Phons (Phon y) (Phons (Phon u)))446 Word → (Phons (Phon w) (Phons (Phon A) (Phons (Phon t))))374 Word → (Phons (Phon D) (Phons (Phon 6)))372 Word → (Phons (Phon &) (Phons (Phon n) (Phons (Phon d))))
30/68
Undersegmentation errors with Unigram modelWords → Word+ Word → Phon+
Unigram word segmentation model assumes each word is generatedindependentlyBut there are strong inter-word dependencies (collocations)Unigram model can only capture such dependencies by analyzingcollocations as words (Goldwater 2006)
WordsWordt ɛi k
Wordð ə d ɑ g i
Wordɑu t
WordsWord
j u w ɑ n t t uWord
s i ð əWord
b ʊ k31/68
Accuracy of unigram model
Number of training sentences
Toke
n f-s
core
0.0
0.2
0.4
0.6
0.8
1 10 100 1000 10000
32/68
Boundary precision of unigram model
Number of training sentences
Bou
ndar
y P
reci
sion
0.0
0.2
0.4
0.6
0.8
1.0
1 10 100 1000 10000
Boundary recall of unigram model
Number of training sentences
Bou
ndar
y R
ecal
l
0.0
0.2
0.4
0.6
0.8
1 10 100 1000 10000
33/68
Collocations ⇒ WordsSentence → Colloc+Colloc → Word+
Word → Phon+
Sentence
Colloc
Word
j u
Word
w ɑ n t t u
Colloc
Word
s i
Colloc
Word
ð ə
Word
b ʊ kA Colloc(ation) consists of one or more wordsBoth Words and Collocs are adapted (learnt)Significantly improves word segmentation accuracy over unigrammodel (76% f-score; ≈ Goldwater’s bigram model)
34/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
35/68
Two hypotheses about language acquisition1. Pre-programmed staged acquisition of linguistic components
▶ Conventional view of lexical acquisition, e.g., Kuhl (2004)– child first learns the phoneme inventory, which it then uses to
learn– phonotactic cues for word segmentation, which are used to
learn– phonological forms of words in the lexicon, . . .
2. Interactive acquisition of all linguistic components together▶ corresponds to joint inference for all components of language▶ stages in language acquisition might be due to:
– child’s input may contain more information about somecomponents
– some components of language may be learnable with less data
36/68
Synergies: an advantage of interactive learning
An interactive learner can take advantage of synergies inacquisition
▶ partial knowledge of component A provides information aboutcomponent B
▶ partial knowledge of component B provides information aboutcomponent A
A staged learner can only take advantage of one of thesedependenciesAn interactive or joint learner can benefit from a positive feedbackcycle between A and BAre there synergies in learning how to segment words andidentifying the referents of words?
37/68
Jointly learning words and syllablesSentence → Colloc+ Colloc → Word+
Word → Syllable{1:3} Syllable → (Onset) RhymeOnset → Consonant+ Rhyme → Nucleus (Coda)Nucleus → Vowel+ Coda → Consonant+
SentenceColloc
WordOnset
lNucleus
ʊCodak
WordNucleus
æCoda
t
CollocWord
Onsetð
Nucleusɪ
Codas
Rudimentary syllable model (an improved model might do better)With 2 Collocation levels, f-score = 84%
38/68
Distinguishing internal onsets/codas helpsSentence → Colloc+ Colloc → Word+
Word → SyllableIF Word → SyllableI SyllableFWord → SyllableI Syllable SyllableF SyllableIF → (OnsetI) RhymeFOnsetI → Consonant+ RhymeF → Nucleus (CodaF)Nucleus → Vowel+ CodaF → Consonant+
SentenceCollocWord
OnsetIh
Nucleusæ
CodaFv
CollocWordNucleus
ə
WordOnsetId r
Nucleusɪ
CodaFŋ k
With 2 Collocation levels, not distinguishing initial/final clusters,f-score = 84%With 3 Collocation levels, distinguishing initial/final clusters,f-score = 87%
39/68
Collocations2 ⇒ Words ⇒ Syllables
Sentence
Colloc2
Colloc
Word
OnsetI
g
Nucleus
ɪ
CodaF
v
Word
OnsetI
h
Nucleus
ɪ
CodaF
m
Colloc
Word
Nucleus
ə
Word
OnsetI
k
Nucleus
ɪ
CodaF
s
Colloc2
Colloc
Word
Nucleus
o
Word
OnsetI
k
Nucleus
e
40/68
Accuracy of Collocation + Syllable model
Number of training sentences
Toke
n f-s
core
0.0
0.2
0.4
0.6
0.8
1.0
1 10 100 1000 10000
41/68
Accuracy of Collocation + Syllable model by word frequency
Number of training sentences
Acc
urac
y (f-
scor
e)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
10-20
200-500
1 10 100 1000 10000
20-50
500-1000
1 10 100 1000 10000
50-100
1000-2000
1 10 100 1000 10000
100-200
1 10 100 1000 10000
42/68
Interaction between syllable phonotactics andsegmentation
Word segmentation accuracy depends on the kinds ofgeneralisations the model can learn
words as units (unigram) 56%+ associations between words (collocations) 76%+ syllable structure 84%+ interaction between
segmentation and syllable structure 87%
Synergies in learning words and syllable structure▶ joint inference permits the learner to explain away potentially
misleading generalizations
43/68
Exploiting stress in word segmentationStress is the “accentuation of syllables within words”
▶ 2-syllable words with initial stress: GIant, PICture, HEAting▶ 2-syllable words with final stress: toDAY, aHEAD, aLLOW
English has a strong preference for initial stress (Cutler 1987)▶ 50% of tokens / 85% of types have initial stress▶ but: 50% of tokens / 5% of types are unstressed
Strong evidence that English-speaking children use stress for wordsegmentationData preparation: stress marked on vowel nucleus
j △ u ▲ w △ ɑ* △ n △ t ▲ t △ u ▲ s △ i* ▲ ð △ ə ▲ b △ ʊ* △ k“you want to see the book”
▶ c.f. Johnson and Demuth (2010) tone annotation in Chinese▶ function words are unstressed (contra Yang and others)
44/68
Learning stress patterns with AGs
Grammar can represent all possible stress patterns (up to 4syllables)Stress pattern probabilities learned jointly with phonotactics andsegmentation
WordStressedUnstressed
SSyllOnset
pSRhyme
SNucleusɪ *
Codak
USyllOnsett ʃ
UNucleusə
Codar
WordUnstressedStressedUSyll
Onsett
URhymeUNucleus
ə
SSyllOnset
dSRhymeSNucleus
e *
45/68
Stress and phonotactics in word segmentationModels differ only in kinds of generalisations they can form
▶ phonotactic models learn generalisations about word edges▶ stress models learn probability of strong/weak sequences
Model Accuracycollocations + syllable structure 0.81+ phonotactic cues 0.85+ stress 0.86+ both 0.88
Token f-score on the Alex portion of the Providence corpusBoth phonotactics and stress are useful cues for word segmentationPerformance improves when both are used ⇒ complementary cuesfor word segmentation
46/68
Stress and phonotactics over time
0.65
0.70
0.75
0.80
0.85
100 200 500 1000 2000 5000 10000number of input utterances
tokenf-score
colloc3-nophoncolloc3-phoncolloc3-nophon-stresscolloc3-phon-stress
Joint stress+phonotactic model is best with small dataModels with either eventually catch up
47/68
More on learning stress
Probability of initial stress and unstressed word rules rapidlyconverges on their type frequencies in the dataConsistently underestimates probability of stress-second patterns(true type frequency = 0.07, estimated type frequency = 0.04)
▶ stress-second is also problematic for English childrenProbability of word rules with more than one stress approacheszero as data grows⇒ Unique stress constraint (Yang 2004) can be acquired
48/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
49/68
Prior work: mapping words to topics
Input to learner:▶ word sequence: Is that the pig?▶ objects in nonlinguistic context: dog, pig
Learning objectives:▶ identify utterance topic: pig▶ identify word-topic mapping: pig 7→ pig
50/68
Frank et al (2009) “topic models” as PCFGsPrefix sentences with possibletopic marker, e.g., pig|dogPCFG rules choose a topic fromtopic marker and propagate itthrough sentenceEach word is either generatedfrom sentence topic or nulltopic ∅
SentenceTopicpig
TopicpigTopicpig
TopicpigTopicpigpig|dog
Word∅is
Word∅that
Word∅the
Wordpigpig
Grammar can require at most one topical word per sentenceBayesian inference for PCFG rules and trees corresponds toBayesian inference for word and sentence topics using topic model(Johnson 2010)
51/68
AGs for joint segmentation and topic-mappingCombine topic-model PCFG with word segmentation AGsInput consists of unsegmented phonemic forms prefixed withpossible topics:
pig|dog ɪ z ð æ t ð ə p ɪ g
E.g., combination of Frank “topic model”and unigram segmentation model
▶ equivalent to Jones et al (2010)
Easy to define othercombinations of topicmodels andsegmentation models
SentenceTopicpig
TopicpigTopicpig
TopicpigTopicpigpig|dog
Word∅ɪ z
Word∅ð æ t
Word∅ð ə
Wordpigp ɪ g
52/68
Collocation topic model AG
Sentence
TopicpigTopicpig
Topicpigpig|dog
Colloc∅Word∅ɪ z
Word∅ð æ t
CollocpigWord∅ð ə
Wordpigp ɪ g
Collocations are either “topical” or notEasy to modify this grammar so
▶ at most one topical word per sentence, or▶ at most one topical word per topical collocation
53/68
Does non-linguistic context help segmentation?Model word segmentation
segmentation topics token f-scoreunigram not used 0.533unigram any number 0.537unigram one per sentence 0.547
collocation not used 0.695collocation any number 0.726collocation one per sentence 0.719collocation one per collocation 0.750
Not much improvement with unigram model▶ consistent with results from Jones et al (2010)
Larger improvement with collocation model▶ most gain with one topical word per topical collocation
(this constraint cannot be imposed on unigram model)
54/68
Does better segmentation help topic identification?Task: identify object (if any) this sentence is about
Model sentence topicsegmentation topics accuracy f-score
unigram not used 0.709 0unigram any number 0.702 0.355unigram one per sentence 0.503 0.495
collocation not used 0.709 0collocation any number 0.728 0.280collocation one per sentence 0.440 0.493collocation one per collocation 0.839 0.747
The collocation grammar with one topical word per topicalcollocation is the only model clearly better than baseline
55/68
Does better segmentation help learning word-topicmappings?
Task: identify head nouns of NPs referring to topical objects(e.g. pɪg 7→ pig in input pig | dog ɪ z ð æ t ð ə p ɪ g)
Model topical wordsegmentation topics f-score
unigram not used 0unigram any number 0.149unigram one per sentence 0.147
collocation not used 0collocation any number 0.220collocation one per sentence 0.321collocation one per collocation 0.636
The collocation grammar with one topical word per topicalcollocation is best at identifying head nouns of topical NPs
56/68
Accuracy as a function of grammar and topicality
Number of training sentences
Acc
urac
y (f-
scor
e)
0.0
0.2
0.4
0.6
0.8
1.0
1 10 100 1000 10000
Grammar/Topicality
colloc Not_topical
colloc Topical
Tcolloc1 Not_topical
Tcolloc1 Topical
57/68
Summary of jointly learning word segmentationand word-to-topic mappings
Word to object mapping is learnt more accurately when words aresegmented more accurately
▶ improving segmentation accuracy improves topic detection andacquisition of topical words
Word segmentation accuracy improves when exploitingnon-linguistic context information
▶ incorporating word-topic mapping improves segmentation accuracy(at least with collocation grammars)
⇒ There are synergies a learner can exploit when learning wordsegmentation and word-object mappings
58/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
59/68
Social cues and word-topic mappingSocial interactions are important for early language acquisitionCan computational models exploit social cues?
▶ we show this by building models that can exploit social cues, andshow they learns better word-topic mappings on data with socialcues than when social cues are removed
▶ no evidence that social cues improve word-segmentation accuracyOur models learn relative importance of different social cues
▶ estimate probability of each cue occuring with “topical objects”and probability of each cue occuring with “non-topical objects”
▶ they do this in an unsupervised way, i.e., they are not told whichobjects are topical
▶ ablation tests show that eye-gaze is the most important social cuefor learning word-topic mappings
60/68
Function words in word segmentation
Some psychologists believe children exploit function words in earlylanguage acquisition
▶ function words often are high frequency and phonologically simple⇒ easy to learn?
▶ function words typically appear in phrase-peripheral positions⇒ provide “anchors” for word and phrase segmentation
Modify word segmentation grammar to optionally generatesequences of mono-syllabic “function words” at collocation edges
▶ improves word segmentation f-score from 0.87 to 0.92Model can learn directionality of function word attachment
▶ Bayes factor hypothesis test overwhelmingly prefers left to rightfunction word attachment in English
61/68
Jointly learning word segmentation andphonological alternation
Word segmentation models so far don’t account for phonologicalalternations (e.g., final devoicing, /t/-deletion, etc.)
▶ fundamental operation in CFG is string concatenation▶ no principled reason why adaptor grammars can’t be combined
with phonological operationsBorschinger, Johnson and Demuth (2013) generalises Goldwater’sbigram word segmentation model to allow word-final /t/-deletion
▶ applied to Buckeye corpus of adult speechCurrent work: incorporate a MaxEnt “Harmony theory” model ofphonological alternation
62/68
On-line learning using particle filtersThe Adaptor Grammar software uses a batch algorithm thatrepeatedly parses the data
▶ in principle, all the algorithm requires is a source of randomsamples from the training data
⇒ Adaptor Grammars can be learned on-lineParticle filters are a standard technique for on-line Bayesianinference
▶ a particle filter updates multiple analyses in parallelBorschinger and Johnson (2011, 2012) explore on-line particle filteralgorithms for Goldwater’s bigram model
▶ a particle filter needs tens of thousands of particles to approachthe Metropolis-within-Gibbs algorithm used for Adaptor Grammarshere
▶ adding a rejuvenation step reduces the number of particles neededdramatically
63/68
Outline
IntroductionLimitations of PCFGsModelling lexical acquisition with adaptor grammarsSynergies in learning syllables and wordsTopic models and identifying the referents of wordsExtensions and applications of these modelsConclusion
64/68
Conclusions and future workJoint learning often uses information in the input more effectivelythan staged learning
▶ Learning syllable structure and word segmentation▶ Learning word-topic associations and word segmentation
Do children exploit such synergies in language acquisition?Adaptor grammars are a flexible framework for statingnon-parametric hierarchical Bayesian models
▶ the accuracies obtained here are the best reported in the literatureFuture work: make the models more realistic
▶ extend expressive power of AGs (e.g., incorporatingMaxEnt/Harmony-theory components)
▶ richer data (e.g., more non-linguistic context)▶ more realistic data (e.g., phonological variation)
65/68
How specific should our computational models be?Marr’s (1982) three levels of computational models:
▶ computational level (inputs, outputs and relation between them)▶ algorithmic level (steps involved in mapping from input to output)▶ implementation level (physical processes involved)
Algorithmic-level models are extremely popular, but I think weshould focus on computational-level models first
▶ we know almost nothing about how hierarchical structures arerepresented and manipulated in the brain
⇒ we know almost nothing about which data structures andoperations are neurologically plausible
▶ current models only explain a tiny fraction of language processingor acquisition
▶ typically computational models can be extended, while algorithmsneed to be completely changed
⇒ today’s computational models have a greater chance of beingrelevant than today’s algorithms
66/68
Why a child’s learning algorithm may be nothinglike our algorithms
Enormous differences in “hardware” ⇒ different feasible algorithmsAs scientists we need generic learning algorithms, but a child onlyneeds a specific learning algorithm
▶ as scientists we want to study the effects of different modellingassumptions on learning
⇒ we need generic algorithms that work for a range of differentmodels, so we can compare them
▶ a child only needs an algorithm that works for whatever modelthey have
⇒ the child’s algorithm might be specialised to their model, and neednot work at all for other kinds of models
The field of machine learning has developed many generic learningalgorithms: Expectation-maximisation, variational Bayes, Markovchain Monte Carlo, Gibbs samplers, particle filters, . . .
67/68
The future of Bayesian models of languageacquisition
P(Grammar | Data)︸ ︷︷ ︸Posterior
∝ P(Data | Grammar)︸ ︷︷ ︸Likelihood
P(Grammar)︸ ︷︷ ︸Prior
So far our grammars and priors don’t encode much linguisticknowledge, but in principle they can!
▶ how do we represent this knowledge?▶ how can we learn efficiently using this knowledge?
Should permit us to empirically investigate effects of specificuniversals on the course of language acquisitionMy guess: the interaction between innate knowledge and learningwill be richer and more interesting than either the rationalists orempiricists currently imagine!
68/68