learning to discover, ground and use words with segmental

13
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6429–6441 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 6429 Learning to Discover, Ground and Use Words with Segmental Neural Language Models Kazuya Kawakami ♠♣ Chris Dyer Phil Blunsom ♠♣ Department of Computer Science, University of Oxford, Oxford, UK DeepMind, London, UK {kawakamik,cdyer,pblunsom}@google.com Abstract We propose a segmental neural language model that combines the generalization power of neural networks with the ability to discover word-like units that are latent in unsegmented character sequences. In contrast to previous segmentation models that treat word segmen- tation as an isolated task, our model unifies word discovery, learning how words fit to- gether to form sentences, and, by condition- ing the model on visual context, how words’ meanings ground in representations of non- linguistic modalities. Experiments show that the unconditional model learns predictive dis- tributions better than character LSTM models, discovers words competitively with nonpara- metric Bayesian word segmentation models, and that modeling language conditional on vi- sual context improves performance on both. 1 Introduction How infants discover words that make up their first language is a long-standing question in develop- mental psychology (Saffran et al., 1996). Machine learning has contributed much to this discussion by showing that predictive models of language are capable of inferring the existence of word bound- aries solely based on statistical properties of the input (Elman, 1990; Brent and Cartwright, 1996; Goldwater et al., 2009). However, there are two se- rious limitations of current models of word learning in the context of the broader problem of language acquisition. First, language acquisition involves not only learning what words there are (“the lexicon”), but also how they fit together (“the grammar”). Un- fortunately, the best language models, measured in terms of their ability to predict language (i.e., those which seem acquire grammar best), segment quite poorly (Chung et al., 2017; Wang et al., 2017; Kádár et al., 2018), while the strongest models in terms of word segmentation (Goldwater et al., 2009; Berg-Kirkpatrick et al., 2010) do not ade- quately account for the long-range dependencies that are manifest in language and that are easily captured by recurrent neural networks (Mikolov et al., 2010). Second, word learning involves not only discovering what words exist and how they fit together grammatically, but also determining their non-linguistic referents, that is, their grounding. The work that has looked at modeling acquisition of grounded language from character sequences— usually in the context of linking words to a visu- ally experienced environment—has either explic- itly avoided modeling word units (Gelderloos and Chrupala, 2016) or relied on high-level represen- tations of visual context that overly simplify the richness and ambiguity of the visual signal (John- son et al., 2010; Räsänen and Rasilo, 2015). In this paper, we introduce a single model that discovers words, learns how they fit together (not just locally, but across a complete sentence), and grounds them in learned representations of natu- ralistic non-linguistic visual contexts. We argue that such a unified model is preferable to a pipeline model of language acquisition (e.g., a model where words are learned by one character-aware model, and then a full-sentence grammar is acquired by a second language model using the words predicted by the first). Our preference for the unified model may be expressed in terms of basic notions of sim- plicity (we require one model rather than two), and in terms of the Continuity Hypothesis of Pinker (1984), which argues that we should assume, ab- sent strong evidence to the contrary, that children have the same cognitive systems as adults, and dif- ferences are due to them having set their parameters differently/immaturely. In §2 we introduce a neural model of sentences that explicitly discovers and models word-like units from completely unsegmented sequences of char- acters. Since it is a model of complete sentences

Upload: others

Post on 11-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Discover, Ground and Use Words with Segmental

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics pages 6429ndash6441Florence Italy July 28 - August 2 2019 ccopy2019 Association for Computational Linguistics

6429

Learning to Discover Ground and Use Wordswith Segmental Neural Language Models

Kazuya Kawakamispadesclubs Chris Dyerclubs Phil BlunsomspadesclubsspadesDepartment of Computer Science University of Oxford Oxford UK

clubsDeepMind London UKkawakamikcdyerpblunsomgooglecom

Abstract

We propose a segmental neural languagemodel that combines the generalization powerof neural networks with the ability to discoverword-like units that are latent in unsegmentedcharacter sequences In contrast to previoussegmentation models that treat word segmen-tation as an isolated task our model unifiesword discovery learning how words fit to-gether to form sentences and by condition-ing the model on visual context how wordsrsquomeanings ground in representations of non-linguistic modalities Experiments show thatthe unconditional model learns predictive dis-tributions better than character LSTM modelsdiscovers words competitively with nonpara-metric Bayesian word segmentation modelsand that modeling language conditional on vi-sual context improves performance on both

1 Introduction

How infants discover words that make up their firstlanguage is a long-standing question in develop-mental psychology (Saffran et al 1996) Machinelearning has contributed much to this discussionby showing that predictive models of language arecapable of inferring the existence of word bound-aries solely based on statistical properties of theinput (Elman 1990 Brent and Cartwright 1996Goldwater et al 2009) However there are two se-rious limitations of current models of word learningin the context of the broader problem of languageacquisition First language acquisition involves notonly learning what words there are (ldquothe lexiconrdquo)but also how they fit together (ldquothe grammarrdquo) Un-fortunately the best language models measuredin terms of their ability to predict language (iethose which seem acquire grammar best) segmentquite poorly (Chung et al 2017 Wang et al 2017Kaacutedaacuter et al 2018) while the strongest modelsin terms of word segmentation (Goldwater et al

2009 Berg-Kirkpatrick et al 2010) do not ade-quately account for the long-range dependenciesthat are manifest in language and that are easilycaptured by recurrent neural networks (Mikolovet al 2010) Second word learning involves notonly discovering what words exist and how they fittogether grammatically but also determining theirnon-linguistic referents that is their groundingThe work that has looked at modeling acquisitionof grounded language from character sequencesmdashusually in the context of linking words to a visu-ally experienced environmentmdashhas either explic-itly avoided modeling word units (Gelderloos andChrupała 2016) or relied on high-level represen-tations of visual context that overly simplify therichness and ambiguity of the visual signal (John-son et al 2010 Raumlsaumlnen and Rasilo 2015)

In this paper we introduce a single model thatdiscovers words learns how they fit together (notjust locally but across a complete sentence) andgrounds them in learned representations of natu-ralistic non-linguistic visual contexts We arguethat such a unified model is preferable to a pipelinemodel of language acquisition (eg a model wherewords are learned by one character-aware modeland then a full-sentence grammar is acquired by asecond language model using the words predictedby the first) Our preference for the unified modelmay be expressed in terms of basic notions of sim-plicity (we require one model rather than two) andin terms of the Continuity Hypothesis of Pinker(1984) which argues that we should assume ab-sent strong evidence to the contrary that childrenhave the same cognitive systems as adults and dif-ferences are due to them having set their parametersdifferentlyimmaturely

In sect2 we introduce a neural model of sentencesthat explicitly discovers and models word-like unitsfrom completely unsegmented sequences of char-acters Since it is a model of complete sentences

6430

(rather than just a word discovery model) and itcan incorporate multimodal conditioning context(rather than just modeling language uncondition-ally) it avoids the two continuity problems identi-fied above Our model operates by generating textas a sequence of segments where each segmentis generated either character-by-character from asequence model or as a single draw from a lexicalmemory of multi-character units The segmenta-tion decisions and decisions about how to generatewords are not observed in the training data andmarginalized during learning using a dynamic pro-gramming algorithm (sect3)

Our model depends crucially on two componentsThe first is as mentioned a lexical memory Thislexicon stores pairs of a vector (key) and a string(value) the strings in the lexicon are contiguoussequences of characters encountered in the trainingdata and the vectors are randomly initialized andlearned during training The second componentis a regularizer (sect4) that prevents the model fromoverfitting to the training data by overusing thelexicon to account for the training data1

Our evaluation (sect5ndashsect7) looks at both languagemodeling performance and the quality of theinduced segmentations in both unconditional(sequence-only) contexts and when conditioningon a related image First we look at the seg-mentations induced by our model We find thatthese correspond closely to human intuitions aboutword segments competitive with the best exist-ing models for unsupervised word discovery Im-portantly these segments are obtained in modelswhose hyperparameters are tuned to optimize val-idation (held-out) likelihood whereas tuning thehyperparameters of our benchmark models usingheld-out likelihood produces poor segmentationsSecond we confirm findings (Kawakami et al2017 Mielke and Eisner 2018) that show thatword segmentation information leads to better lan-guage models compared to pure character modelsHowever in contrast to previous work we realizethis performance improvement without having toobserve the segment boundaries Thus our modelmay be applied straightforwardly to Chinese whereword boundaries are not part of the orthography

1Since the lexical memory stores strings that appear in thetraining data each sentence could in principle be generatedas a single lexical unit thus the model could fit the trainingdata perfectly while generalizing poorly The regularizer pe-nalizes based on the expectation of the powered length ofeach segment preventing this degenerate solution from beingoptimal

Ablation studies demonstrate that both the lexi-con and the regularizer are crucial for good per-formance particularly in word segmentationmdashremoving either or both significantly harms per-formance In a final experiment we learn to modellanguage that describes images and we find thatconditioning on visual context improves segmen-tation performance in our model (compared to theperformance when the model does not have accessto the image) On the other hand in a baselinemodel that predicts boundaries based on entropyspikes in a character-LSTM making the imageavailable to the model has no impact on the qualityof the induced segments demonstrating again thevalue of explicitly including a word lexicon in thelanguage model

2 Model

We now describe the segmental neural languagemodel (SNLM) Refer to Figure 1 for an illustrationThe SNLM generates a character sequence x =x1 xn where each xi is a character in a finitecharacter set Σ Each sequence x is the concatena-tion of a sequence of segments s = s1 s|s|where |s| le n measures the length of the se-quence in segments and each segment si isin Σ+

is a sequence of characters si1 si|si| In-tuitively each si corresponds to one word Letπ(s1 si) represent the concatenation of thecharacters of the segments s1 to si discarding seg-mentation information thus x = π(s) For exam-ple if x = anapple the underlying segmentationmight be s = an apple (with s1 = an ands2 = apple) or s = a nap ple or any of the2|x|minus1 segmentation possibilities for x

The SNLM defines the distribution over x as themarginal distribution over all segmentations thatgive rise to x ie

p(x) =sum

sπ(s)=x

p(s) (1)

To define the probability of p(s) we use the chainrule rewriting this in terms of a product of theseries of conditional probabilities p(st | sltt) Theprocess stops when a special end-sequence segment〈S〉 is generated To ensure that the summation inEq 1 is tractable we assume the following

p(st | sltt) asymp p(st | π(sltt)) = p(st | xltt) (2)

which amounts to a conditional semi-Markovassumptionmdashie non-Markovian generation hap-

6431

nC a uy o l o ko ta

hellip

o

l

o

o

k

o

ltwgt

k

l

ltwgt l

l o

l o o k hellip

ap app

appl

appl

e

lo loo

look

l o o k

hellip

look

slo

oke

look

ed

Figure 1 Fragment of the segmental neural language model while evaluating the marginal likelihood of a sequenceAt the indicated time the model has generated the sequence Canyou and four possible continuations are shown

pens inside each segment but the segment genera-tion probability does not depend on memory of theprevious segmentation decisions only upon the se-quence of characters π(sltt) corresponding to theprefix character sequence xltt This assumptionhas been employed in a number of related modelsto permit the use of LSTMs to represent rich his-tory while retaining the convenience of dynamicprogramming inference algorithms (Wang et al2017 Ling et al 2017 Graves 2012)

21 Segment generation

We model p(st | xltt) as a mixture of two modelsone that generates the segment using a sequencemodel and the other that generates multi-charactersequences as a single event Both are conditionalon a common representation of the history as is themixture proportion

Representing history To represent xltt we usean LSTM encoder to read the sequence of charac-ters where each character type σ isin Σ has a learnedvector embedding vσ Thus the history represen-tation at time t is ht = LSTMenc(vx1 vxt)This corresponds to the standard history representa-tion for a character-level language model althoughin general we assume that our modelled data is notdelimited by whitespace

Character-by-character generation The firstcomponent model pchar(st | ht) generates st bysampling a sequence of characters from a LSTMlanguage model over Σ and a two extra specialsymbols an end-of-word symbol 〈W〉 isin Σ andthe end-of-sequence symbol 〈S〉 discussed above

The initial state of the LSTM is a learned trans-formation of ht the initial cell is 0 and differentparameters than the history encoding LSTM areused During generation each letter that is sam-pled (ie each sti) is fed back into the LSTM inthe usual way and the probability of the charac-ter sequence decomposes according to the chainrule The end-of-sequence symbol can never begenerated in the initial position

Lexical generation The second componentmodel plex(st | ht) samples full segments fromlexical memory Lexical memory is a key-valuememory containing M entries where each key kia vector is associated with a value vi isin Σ+ Thegeneration probability of st is defined as

hprimet = MLP(ht)

m = softmax(Khprimet + b)

plex(st | ht) =Msumi=1

mi[vi = st]

where [vi = st] is 1 if the ith value in memory isst and 0 otherwise and K is a matrix obtained bystacking the kgti rsquos This generation process assignszero probability to most strings but the alternatecharacter model can generate all of Σ+

In this work we fix the virsquos to be subsequencesof at least length 2 and up to a maximum lengthL that are observed at least F times in the trainingdata These values are tuned as hyperparameters(See Appendix C for details of the experiments)

Mixture proportion The mixture proportion gtdetermines how likely the character generator is to

6432

be used at time t (the lexicon is used with probabil-ity 1minus gt) It is defined by as gt = σ(MLP(ht))

Total segment probability The total generationprobability of st is thus

p(st | xltt) = gtpchar(st | ht)+(1minus gt)plex(st | ht)

3 Inference

We are interested in two inference questions firstgiven a sequence x evaluate its (log) marginallikelihood second given x find the most likelydecomposition into segments slowast

Marginal likelihood To efficiently compute themarginal likelihood we use a variant of the forwardalgorithm for semi-Markov models (Yu 2010)which incrementally computes a sequence of prob-abilities αi where αi is the marginal likelihood ofgenerating xlei and concluding a segment at timei Although there are an exponential number ofsegmentations of x these values can be computedusing O(|x|) space and O(|x|2) time as

α0 = 1 αt =

tminus1sumj=tminusL

αjp(s = xjt | xltj)

(3)

By letting xt+1 = 〈S〉 then p(x) = αt+1

Most probable segmentation The most proba-ble segmentation of a sequence x can be computedby replacing the summation with a max operatorin Eq 3 and maintaining backpointers

4 Expected length regularization

When the lexical memory contains all the sub-strings in the training data the model easily over-fits by copying the longest continuation from thememory To prevent overfitting we introduce a reg-ularizer that penalizes based on the expectation ofthe exponentiated (by a hyperparameter β) lengthof each segment

R(x β) =sum

sπ(s)=x

p(s | x)sumsisins|s|β

This can be understood as a regularizer based onthe double exponential prior identified to be ef-fective in previous work (Liang and Klein 2009Berg-Kirkpatrick et al 2010) This expectation

is a differentiable function of the model parame-ters Because of the linearity of the penalty acrosssegments it can be computed efficiently using theabove dynamic programming algorithm under theexpectation semiring (Eisner 2002) This is par-ticularly efficient since the expectation semiringjointly computes the expectation and marginal like-lihood in a single forward pass For more detailsabout computing gradients of expectations underdistributions over structured objects with dynamicprograms and semirings see Li and Eisner (2009)

41 Training ObjectiveThe model parameters are trained by minimizingthe penalized log likelihood of a training corpus Dof unsegmented sentences

L =sumxisinD

[minus log p(x) + λR(x β)]

5 Datasets

We evaluate our model on both English and Chi-nese segmentation For both languages we usedstandard datasets for word segmentation and lan-guage modeling We also use MS-COCO to evalu-ate how the model can leverage conditioning con-text information For all datasets we used trainvalidation and test splits2 Since our model assumesa closed character set we removed validation andtest samples which contain characters that do notappear in the training set In the English corporawhitespace characters are removed In Chinesethey are not present to begin with Refer to Ap-pendix A for dataset statistics

51 EnglishBrent Corpus The Brent corpus is a standardcorpus used in statistical modeling of child lan-guage acquisition (Brent 1999 Venkataraman2001)3 The corpus contains transcriptions of utter-ances directed at 13- to 23-month-old children Thecorpus has two variants an orthographic one (BR-text) and a phonemic one (BR-phono) where eachcharacter corresponds to a single English phonemeAs the Brent corpus does not have a standard trainand test split and we want to tune the parametersby measuring the fit to held-out data we used thefirst 80 of the utterances for training and the next10 for validation and the rest for test

2The data and splits used are available athttpss3eu-west-2amazonawscomk-kawakamisegzip

3httpschildestalkbankorgderived

6433

English Penn Treebank (PTB) We use the com-monly used version of the PTB prepared byMikolov et al (2010) However since we removedspace symbols from the corpus our cross entropyresults cannot be compared to those usually re-ported on this dataset

52 Chinese

Since Chinese orthography does not mark spacesbetween words there have been a number of effortsto annotate word boundaries We evaluate againsttwo corpora that have been manually segmentedaccording different segmentation standards

Beijing University Corpus (PKU) The BeijingUniversity Corpus was one of the corpora usedfor the International Chinese Word SegmentationBakeoff (Emerson 2005)

Chinese Penn Treebank (CTB) We use thePenn Chinese Treebank Version 51 (Xue et al2005) It generally has a coarser segmentation thanPKU (eg in CTB a full name consisting of agiven name and family name is a single token)and it is a larger corpus

53 Image Caption Dataset

To assess whether jointly learning about meaningsof words from non-linguistic context affects seg-mentation performance we use image and captionpairs from the COCO caption dataset (Lin et al2014) We use 10000 examples for both trainingand testing and we only use one reference per im-age The images are used to be conditional contextto predict captions Refer to Appendix B for thedataset construction process

6 Experiments

We compare our model to benchmark Bayesianmodels which are currently the best known unsu-pervised word discovery models as well as to asimple deterministic segmentation criterion basedon surprisal peaks (Elman 1990) on language mod-eling and segmentation performance Althoughthe Bayeisan models are shown to able to discoverplausible word-like units we found that a set of hy-perparameters that provides best performance withsuch model on language modeling does not pro-duce good structures as reported in previous worksThis is problematic since there is no objective cri-teria to find hyperparameters in fully unsupervisedmanner when the model is applied to completely

unknown languages or domains Thus our experi-ments are designed to assess how well the modelsinfers word segmentations of unsegmented inputswhen they are trained and tuned to maximize thelikelihood of the held-out text

DPHDP Benchmarks Among the most effec-tive existing word segmentation models are thosebased on hierarchical Dirichlet process (HDP) mod-els (Goldwater et al 2009 Teh et al 2006) and hi-erarchical PitmanndashYor processes (Mochihashi et al2009) As a representative of these we use a simplebigram HDP model

θmiddot sim DP(α0 p0)

θmiddot|s sim DP(α1 θmiddot) foralls isin Σlowast

st+1 | st sim Categorical(θmiddot|st)

The base distribution p0 is defined over strings inΣlowastcup〈S〉 by deciding with a specified probabilityto end the utterance a geometric length model anda uniform probability over Σ at a each position In-tuitively it captures the preference for having shortwords in the lexicon In addition to the HDP modelwe also evaluate a simpler single Dirichlet process(DP) version of the model in which the strsquos aregenerated directly as draws from Categorical(θmiddot)We use an empirical Bayesian approach to selecthyperparameters based on the likelihood assignedby the inferred posterior to a held-out validationset Refer to Appendix D for details on inference

Deterministic Baselines Incremental word seg-mentation is inherently ambiguous (eg the let-ters the might be a single word or they might bethe beginning of the longer word theater) Never-theless several deterministic functions of prefixeshave been proposed in the literature as strategies fordiscovering rudimentary word-like units hypothe-sized for being useful for bootstrapping the lexicalacquisition process or for improving a modelrsquos pre-dictive accuracy These range from surprisal crite-ria (Elman 1990) to sophisticated language modelsthat switch between models that capture intra- andinter-word dynamics based on deterministic func-tions of prefixes of characters (Chung et al 2017Shen et al 2018)

In our experiments we also include such deter-ministic segmentation results using (1) the surprisalcriterion of Elman (1990) and (2) a two-level hi-erarchical multiscale LSTM (Chung et al 2017)which has been shown to predict boundaries in

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 2: Learning to Discover, Ground and Use Words with Segmental

6430

(rather than just a word discovery model) and itcan incorporate multimodal conditioning context(rather than just modeling language uncondition-ally) it avoids the two continuity problems identi-fied above Our model operates by generating textas a sequence of segments where each segmentis generated either character-by-character from asequence model or as a single draw from a lexicalmemory of multi-character units The segmenta-tion decisions and decisions about how to generatewords are not observed in the training data andmarginalized during learning using a dynamic pro-gramming algorithm (sect3)

Our model depends crucially on two componentsThe first is as mentioned a lexical memory Thislexicon stores pairs of a vector (key) and a string(value) the strings in the lexicon are contiguoussequences of characters encountered in the trainingdata and the vectors are randomly initialized andlearned during training The second componentis a regularizer (sect4) that prevents the model fromoverfitting to the training data by overusing thelexicon to account for the training data1

Our evaluation (sect5ndashsect7) looks at both languagemodeling performance and the quality of theinduced segmentations in both unconditional(sequence-only) contexts and when conditioningon a related image First we look at the seg-mentations induced by our model We find thatthese correspond closely to human intuitions aboutword segments competitive with the best exist-ing models for unsupervised word discovery Im-portantly these segments are obtained in modelswhose hyperparameters are tuned to optimize val-idation (held-out) likelihood whereas tuning thehyperparameters of our benchmark models usingheld-out likelihood produces poor segmentationsSecond we confirm findings (Kawakami et al2017 Mielke and Eisner 2018) that show thatword segmentation information leads to better lan-guage models compared to pure character modelsHowever in contrast to previous work we realizethis performance improvement without having toobserve the segment boundaries Thus our modelmay be applied straightforwardly to Chinese whereword boundaries are not part of the orthography

1Since the lexical memory stores strings that appear in thetraining data each sentence could in principle be generatedas a single lexical unit thus the model could fit the trainingdata perfectly while generalizing poorly The regularizer pe-nalizes based on the expectation of the powered length ofeach segment preventing this degenerate solution from beingoptimal

Ablation studies demonstrate that both the lexi-con and the regularizer are crucial for good per-formance particularly in word segmentationmdashremoving either or both significantly harms per-formance In a final experiment we learn to modellanguage that describes images and we find thatconditioning on visual context improves segmen-tation performance in our model (compared to theperformance when the model does not have accessto the image) On the other hand in a baselinemodel that predicts boundaries based on entropyspikes in a character-LSTM making the imageavailable to the model has no impact on the qualityof the induced segments demonstrating again thevalue of explicitly including a word lexicon in thelanguage model

2 Model

We now describe the segmental neural languagemodel (SNLM) Refer to Figure 1 for an illustrationThe SNLM generates a character sequence x =x1 xn where each xi is a character in a finitecharacter set Σ Each sequence x is the concatena-tion of a sequence of segments s = s1 s|s|where |s| le n measures the length of the se-quence in segments and each segment si isin Σ+

is a sequence of characters si1 si|si| In-tuitively each si corresponds to one word Letπ(s1 si) represent the concatenation of thecharacters of the segments s1 to si discarding seg-mentation information thus x = π(s) For exam-ple if x = anapple the underlying segmentationmight be s = an apple (with s1 = an ands2 = apple) or s = a nap ple or any of the2|x|minus1 segmentation possibilities for x

The SNLM defines the distribution over x as themarginal distribution over all segmentations thatgive rise to x ie

p(x) =sum

sπ(s)=x

p(s) (1)

To define the probability of p(s) we use the chainrule rewriting this in terms of a product of theseries of conditional probabilities p(st | sltt) Theprocess stops when a special end-sequence segment〈S〉 is generated To ensure that the summation inEq 1 is tractable we assume the following

p(st | sltt) asymp p(st | π(sltt)) = p(st | xltt) (2)

which amounts to a conditional semi-Markovassumptionmdashie non-Markovian generation hap-

6431

nC a uy o l o ko ta

hellip

o

l

o

o

k

o

ltwgt

k

l

ltwgt l

l o

l o o k hellip

ap app

appl

appl

e

lo loo

look

l o o k

hellip

look

slo

oke

look

ed

Figure 1 Fragment of the segmental neural language model while evaluating the marginal likelihood of a sequenceAt the indicated time the model has generated the sequence Canyou and four possible continuations are shown

pens inside each segment but the segment genera-tion probability does not depend on memory of theprevious segmentation decisions only upon the se-quence of characters π(sltt) corresponding to theprefix character sequence xltt This assumptionhas been employed in a number of related modelsto permit the use of LSTMs to represent rich his-tory while retaining the convenience of dynamicprogramming inference algorithms (Wang et al2017 Ling et al 2017 Graves 2012)

21 Segment generation

We model p(st | xltt) as a mixture of two modelsone that generates the segment using a sequencemodel and the other that generates multi-charactersequences as a single event Both are conditionalon a common representation of the history as is themixture proportion

Representing history To represent xltt we usean LSTM encoder to read the sequence of charac-ters where each character type σ isin Σ has a learnedvector embedding vσ Thus the history represen-tation at time t is ht = LSTMenc(vx1 vxt)This corresponds to the standard history representa-tion for a character-level language model althoughin general we assume that our modelled data is notdelimited by whitespace

Character-by-character generation The firstcomponent model pchar(st | ht) generates st bysampling a sequence of characters from a LSTMlanguage model over Σ and a two extra specialsymbols an end-of-word symbol 〈W〉 isin Σ andthe end-of-sequence symbol 〈S〉 discussed above

The initial state of the LSTM is a learned trans-formation of ht the initial cell is 0 and differentparameters than the history encoding LSTM areused During generation each letter that is sam-pled (ie each sti) is fed back into the LSTM inthe usual way and the probability of the charac-ter sequence decomposes according to the chainrule The end-of-sequence symbol can never begenerated in the initial position

Lexical generation The second componentmodel plex(st | ht) samples full segments fromlexical memory Lexical memory is a key-valuememory containing M entries where each key kia vector is associated with a value vi isin Σ+ Thegeneration probability of st is defined as

hprimet = MLP(ht)

m = softmax(Khprimet + b)

plex(st | ht) =Msumi=1

mi[vi = st]

where [vi = st] is 1 if the ith value in memory isst and 0 otherwise and K is a matrix obtained bystacking the kgti rsquos This generation process assignszero probability to most strings but the alternatecharacter model can generate all of Σ+

In this work we fix the virsquos to be subsequencesof at least length 2 and up to a maximum lengthL that are observed at least F times in the trainingdata These values are tuned as hyperparameters(See Appendix C for details of the experiments)

Mixture proportion The mixture proportion gtdetermines how likely the character generator is to

6432

be used at time t (the lexicon is used with probabil-ity 1minus gt) It is defined by as gt = σ(MLP(ht))

Total segment probability The total generationprobability of st is thus

p(st | xltt) = gtpchar(st | ht)+(1minus gt)plex(st | ht)

3 Inference

We are interested in two inference questions firstgiven a sequence x evaluate its (log) marginallikelihood second given x find the most likelydecomposition into segments slowast

Marginal likelihood To efficiently compute themarginal likelihood we use a variant of the forwardalgorithm for semi-Markov models (Yu 2010)which incrementally computes a sequence of prob-abilities αi where αi is the marginal likelihood ofgenerating xlei and concluding a segment at timei Although there are an exponential number ofsegmentations of x these values can be computedusing O(|x|) space and O(|x|2) time as

α0 = 1 αt =

tminus1sumj=tminusL

αjp(s = xjt | xltj)

(3)

By letting xt+1 = 〈S〉 then p(x) = αt+1

Most probable segmentation The most proba-ble segmentation of a sequence x can be computedby replacing the summation with a max operatorin Eq 3 and maintaining backpointers

4 Expected length regularization

When the lexical memory contains all the sub-strings in the training data the model easily over-fits by copying the longest continuation from thememory To prevent overfitting we introduce a reg-ularizer that penalizes based on the expectation ofthe exponentiated (by a hyperparameter β) lengthof each segment

R(x β) =sum

sπ(s)=x

p(s | x)sumsisins|s|β

This can be understood as a regularizer based onthe double exponential prior identified to be ef-fective in previous work (Liang and Klein 2009Berg-Kirkpatrick et al 2010) This expectation

is a differentiable function of the model parame-ters Because of the linearity of the penalty acrosssegments it can be computed efficiently using theabove dynamic programming algorithm under theexpectation semiring (Eisner 2002) This is par-ticularly efficient since the expectation semiringjointly computes the expectation and marginal like-lihood in a single forward pass For more detailsabout computing gradients of expectations underdistributions over structured objects with dynamicprograms and semirings see Li and Eisner (2009)

41 Training ObjectiveThe model parameters are trained by minimizingthe penalized log likelihood of a training corpus Dof unsegmented sentences

L =sumxisinD

[minus log p(x) + λR(x β)]

5 Datasets

We evaluate our model on both English and Chi-nese segmentation For both languages we usedstandard datasets for word segmentation and lan-guage modeling We also use MS-COCO to evalu-ate how the model can leverage conditioning con-text information For all datasets we used trainvalidation and test splits2 Since our model assumesa closed character set we removed validation andtest samples which contain characters that do notappear in the training set In the English corporawhitespace characters are removed In Chinesethey are not present to begin with Refer to Ap-pendix A for dataset statistics

51 EnglishBrent Corpus The Brent corpus is a standardcorpus used in statistical modeling of child lan-guage acquisition (Brent 1999 Venkataraman2001)3 The corpus contains transcriptions of utter-ances directed at 13- to 23-month-old children Thecorpus has two variants an orthographic one (BR-text) and a phonemic one (BR-phono) where eachcharacter corresponds to a single English phonemeAs the Brent corpus does not have a standard trainand test split and we want to tune the parametersby measuring the fit to held-out data we used thefirst 80 of the utterances for training and the next10 for validation and the rest for test

2The data and splits used are available athttpss3eu-west-2amazonawscomk-kawakamisegzip

3httpschildestalkbankorgderived

6433

English Penn Treebank (PTB) We use the com-monly used version of the PTB prepared byMikolov et al (2010) However since we removedspace symbols from the corpus our cross entropyresults cannot be compared to those usually re-ported on this dataset

52 Chinese

Since Chinese orthography does not mark spacesbetween words there have been a number of effortsto annotate word boundaries We evaluate againsttwo corpora that have been manually segmentedaccording different segmentation standards

Beijing University Corpus (PKU) The BeijingUniversity Corpus was one of the corpora usedfor the International Chinese Word SegmentationBakeoff (Emerson 2005)

Chinese Penn Treebank (CTB) We use thePenn Chinese Treebank Version 51 (Xue et al2005) It generally has a coarser segmentation thanPKU (eg in CTB a full name consisting of agiven name and family name is a single token)and it is a larger corpus

53 Image Caption Dataset

To assess whether jointly learning about meaningsof words from non-linguistic context affects seg-mentation performance we use image and captionpairs from the COCO caption dataset (Lin et al2014) We use 10000 examples for both trainingand testing and we only use one reference per im-age The images are used to be conditional contextto predict captions Refer to Appendix B for thedataset construction process

6 Experiments

We compare our model to benchmark Bayesianmodels which are currently the best known unsu-pervised word discovery models as well as to asimple deterministic segmentation criterion basedon surprisal peaks (Elman 1990) on language mod-eling and segmentation performance Althoughthe Bayeisan models are shown to able to discoverplausible word-like units we found that a set of hy-perparameters that provides best performance withsuch model on language modeling does not pro-duce good structures as reported in previous worksThis is problematic since there is no objective cri-teria to find hyperparameters in fully unsupervisedmanner when the model is applied to completely

unknown languages or domains Thus our experi-ments are designed to assess how well the modelsinfers word segmentations of unsegmented inputswhen they are trained and tuned to maximize thelikelihood of the held-out text

DPHDP Benchmarks Among the most effec-tive existing word segmentation models are thosebased on hierarchical Dirichlet process (HDP) mod-els (Goldwater et al 2009 Teh et al 2006) and hi-erarchical PitmanndashYor processes (Mochihashi et al2009) As a representative of these we use a simplebigram HDP model

θmiddot sim DP(α0 p0)

θmiddot|s sim DP(α1 θmiddot) foralls isin Σlowast

st+1 | st sim Categorical(θmiddot|st)

The base distribution p0 is defined over strings inΣlowastcup〈S〉 by deciding with a specified probabilityto end the utterance a geometric length model anda uniform probability over Σ at a each position In-tuitively it captures the preference for having shortwords in the lexicon In addition to the HDP modelwe also evaluate a simpler single Dirichlet process(DP) version of the model in which the strsquos aregenerated directly as draws from Categorical(θmiddot)We use an empirical Bayesian approach to selecthyperparameters based on the likelihood assignedby the inferred posterior to a held-out validationset Refer to Appendix D for details on inference

Deterministic Baselines Incremental word seg-mentation is inherently ambiguous (eg the let-ters the might be a single word or they might bethe beginning of the longer word theater) Never-theless several deterministic functions of prefixeshave been proposed in the literature as strategies fordiscovering rudimentary word-like units hypothe-sized for being useful for bootstrapping the lexicalacquisition process or for improving a modelrsquos pre-dictive accuracy These range from surprisal crite-ria (Elman 1990) to sophisticated language modelsthat switch between models that capture intra- andinter-word dynamics based on deterministic func-tions of prefixes of characters (Chung et al 2017Shen et al 2018)

In our experiments we also include such deter-ministic segmentation results using (1) the surprisalcriterion of Elman (1990) and (2) a two-level hi-erarchical multiscale LSTM (Chung et al 2017)which has been shown to predict boundaries in

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 3: Learning to Discover, Ground and Use Words with Segmental

6431

nC a uy o l o ko ta

hellip

o

l

o

o

k

o

ltwgt

k

l

ltwgt l

l o

l o o k hellip

ap app

appl

appl

e

lo loo

look

l o o k

hellip

look

slo

oke

look

ed

Figure 1 Fragment of the segmental neural language model while evaluating the marginal likelihood of a sequenceAt the indicated time the model has generated the sequence Canyou and four possible continuations are shown

pens inside each segment but the segment genera-tion probability does not depend on memory of theprevious segmentation decisions only upon the se-quence of characters π(sltt) corresponding to theprefix character sequence xltt This assumptionhas been employed in a number of related modelsto permit the use of LSTMs to represent rich his-tory while retaining the convenience of dynamicprogramming inference algorithms (Wang et al2017 Ling et al 2017 Graves 2012)

21 Segment generation

We model p(st | xltt) as a mixture of two modelsone that generates the segment using a sequencemodel and the other that generates multi-charactersequences as a single event Both are conditionalon a common representation of the history as is themixture proportion

Representing history To represent xltt we usean LSTM encoder to read the sequence of charac-ters where each character type σ isin Σ has a learnedvector embedding vσ Thus the history represen-tation at time t is ht = LSTMenc(vx1 vxt)This corresponds to the standard history representa-tion for a character-level language model althoughin general we assume that our modelled data is notdelimited by whitespace

Character-by-character generation The firstcomponent model pchar(st | ht) generates st bysampling a sequence of characters from a LSTMlanguage model over Σ and a two extra specialsymbols an end-of-word symbol 〈W〉 isin Σ andthe end-of-sequence symbol 〈S〉 discussed above

The initial state of the LSTM is a learned trans-formation of ht the initial cell is 0 and differentparameters than the history encoding LSTM areused During generation each letter that is sam-pled (ie each sti) is fed back into the LSTM inthe usual way and the probability of the charac-ter sequence decomposes according to the chainrule The end-of-sequence symbol can never begenerated in the initial position

Lexical generation The second componentmodel plex(st | ht) samples full segments fromlexical memory Lexical memory is a key-valuememory containing M entries where each key kia vector is associated with a value vi isin Σ+ Thegeneration probability of st is defined as

hprimet = MLP(ht)

m = softmax(Khprimet + b)

plex(st | ht) =Msumi=1

mi[vi = st]

where [vi = st] is 1 if the ith value in memory isst and 0 otherwise and K is a matrix obtained bystacking the kgti rsquos This generation process assignszero probability to most strings but the alternatecharacter model can generate all of Σ+

In this work we fix the virsquos to be subsequencesof at least length 2 and up to a maximum lengthL that are observed at least F times in the trainingdata These values are tuned as hyperparameters(See Appendix C for details of the experiments)

Mixture proportion The mixture proportion gtdetermines how likely the character generator is to

6432

be used at time t (the lexicon is used with probabil-ity 1minus gt) It is defined by as gt = σ(MLP(ht))

Total segment probability The total generationprobability of st is thus

p(st | xltt) = gtpchar(st | ht)+(1minus gt)plex(st | ht)

3 Inference

We are interested in two inference questions firstgiven a sequence x evaluate its (log) marginallikelihood second given x find the most likelydecomposition into segments slowast

Marginal likelihood To efficiently compute themarginal likelihood we use a variant of the forwardalgorithm for semi-Markov models (Yu 2010)which incrementally computes a sequence of prob-abilities αi where αi is the marginal likelihood ofgenerating xlei and concluding a segment at timei Although there are an exponential number ofsegmentations of x these values can be computedusing O(|x|) space and O(|x|2) time as

α0 = 1 αt =

tminus1sumj=tminusL

αjp(s = xjt | xltj)

(3)

By letting xt+1 = 〈S〉 then p(x) = αt+1

Most probable segmentation The most proba-ble segmentation of a sequence x can be computedby replacing the summation with a max operatorin Eq 3 and maintaining backpointers

4 Expected length regularization

When the lexical memory contains all the sub-strings in the training data the model easily over-fits by copying the longest continuation from thememory To prevent overfitting we introduce a reg-ularizer that penalizes based on the expectation ofthe exponentiated (by a hyperparameter β) lengthof each segment

R(x β) =sum

sπ(s)=x

p(s | x)sumsisins|s|β

This can be understood as a regularizer based onthe double exponential prior identified to be ef-fective in previous work (Liang and Klein 2009Berg-Kirkpatrick et al 2010) This expectation

is a differentiable function of the model parame-ters Because of the linearity of the penalty acrosssegments it can be computed efficiently using theabove dynamic programming algorithm under theexpectation semiring (Eisner 2002) This is par-ticularly efficient since the expectation semiringjointly computes the expectation and marginal like-lihood in a single forward pass For more detailsabout computing gradients of expectations underdistributions over structured objects with dynamicprograms and semirings see Li and Eisner (2009)

41 Training ObjectiveThe model parameters are trained by minimizingthe penalized log likelihood of a training corpus Dof unsegmented sentences

L =sumxisinD

[minus log p(x) + λR(x β)]

5 Datasets

We evaluate our model on both English and Chi-nese segmentation For both languages we usedstandard datasets for word segmentation and lan-guage modeling We also use MS-COCO to evalu-ate how the model can leverage conditioning con-text information For all datasets we used trainvalidation and test splits2 Since our model assumesa closed character set we removed validation andtest samples which contain characters that do notappear in the training set In the English corporawhitespace characters are removed In Chinesethey are not present to begin with Refer to Ap-pendix A for dataset statistics

51 EnglishBrent Corpus The Brent corpus is a standardcorpus used in statistical modeling of child lan-guage acquisition (Brent 1999 Venkataraman2001)3 The corpus contains transcriptions of utter-ances directed at 13- to 23-month-old children Thecorpus has two variants an orthographic one (BR-text) and a phonemic one (BR-phono) where eachcharacter corresponds to a single English phonemeAs the Brent corpus does not have a standard trainand test split and we want to tune the parametersby measuring the fit to held-out data we used thefirst 80 of the utterances for training and the next10 for validation and the rest for test

2The data and splits used are available athttpss3eu-west-2amazonawscomk-kawakamisegzip

3httpschildestalkbankorgderived

6433

English Penn Treebank (PTB) We use the com-monly used version of the PTB prepared byMikolov et al (2010) However since we removedspace symbols from the corpus our cross entropyresults cannot be compared to those usually re-ported on this dataset

52 Chinese

Since Chinese orthography does not mark spacesbetween words there have been a number of effortsto annotate word boundaries We evaluate againsttwo corpora that have been manually segmentedaccording different segmentation standards

Beijing University Corpus (PKU) The BeijingUniversity Corpus was one of the corpora usedfor the International Chinese Word SegmentationBakeoff (Emerson 2005)

Chinese Penn Treebank (CTB) We use thePenn Chinese Treebank Version 51 (Xue et al2005) It generally has a coarser segmentation thanPKU (eg in CTB a full name consisting of agiven name and family name is a single token)and it is a larger corpus

53 Image Caption Dataset

To assess whether jointly learning about meaningsof words from non-linguistic context affects seg-mentation performance we use image and captionpairs from the COCO caption dataset (Lin et al2014) We use 10000 examples for both trainingand testing and we only use one reference per im-age The images are used to be conditional contextto predict captions Refer to Appendix B for thedataset construction process

6 Experiments

We compare our model to benchmark Bayesianmodels which are currently the best known unsu-pervised word discovery models as well as to asimple deterministic segmentation criterion basedon surprisal peaks (Elman 1990) on language mod-eling and segmentation performance Althoughthe Bayeisan models are shown to able to discoverplausible word-like units we found that a set of hy-perparameters that provides best performance withsuch model on language modeling does not pro-duce good structures as reported in previous worksThis is problematic since there is no objective cri-teria to find hyperparameters in fully unsupervisedmanner when the model is applied to completely

unknown languages or domains Thus our experi-ments are designed to assess how well the modelsinfers word segmentations of unsegmented inputswhen they are trained and tuned to maximize thelikelihood of the held-out text

DPHDP Benchmarks Among the most effec-tive existing word segmentation models are thosebased on hierarchical Dirichlet process (HDP) mod-els (Goldwater et al 2009 Teh et al 2006) and hi-erarchical PitmanndashYor processes (Mochihashi et al2009) As a representative of these we use a simplebigram HDP model

θmiddot sim DP(α0 p0)

θmiddot|s sim DP(α1 θmiddot) foralls isin Σlowast

st+1 | st sim Categorical(θmiddot|st)

The base distribution p0 is defined over strings inΣlowastcup〈S〉 by deciding with a specified probabilityto end the utterance a geometric length model anda uniform probability over Σ at a each position In-tuitively it captures the preference for having shortwords in the lexicon In addition to the HDP modelwe also evaluate a simpler single Dirichlet process(DP) version of the model in which the strsquos aregenerated directly as draws from Categorical(θmiddot)We use an empirical Bayesian approach to selecthyperparameters based on the likelihood assignedby the inferred posterior to a held-out validationset Refer to Appendix D for details on inference

Deterministic Baselines Incremental word seg-mentation is inherently ambiguous (eg the let-ters the might be a single word or they might bethe beginning of the longer word theater) Never-theless several deterministic functions of prefixeshave been proposed in the literature as strategies fordiscovering rudimentary word-like units hypothe-sized for being useful for bootstrapping the lexicalacquisition process or for improving a modelrsquos pre-dictive accuracy These range from surprisal crite-ria (Elman 1990) to sophisticated language modelsthat switch between models that capture intra- andinter-word dynamics based on deterministic func-tions of prefixes of characters (Chung et al 2017Shen et al 2018)

In our experiments we also include such deter-ministic segmentation results using (1) the surprisalcriterion of Elman (1990) and (2) a two-level hi-erarchical multiscale LSTM (Chung et al 2017)which has been shown to predict boundaries in

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 4: Learning to Discover, Ground and Use Words with Segmental

6432

be used at time t (the lexicon is used with probabil-ity 1minus gt) It is defined by as gt = σ(MLP(ht))

Total segment probability The total generationprobability of st is thus

p(st | xltt) = gtpchar(st | ht)+(1minus gt)plex(st | ht)

3 Inference

We are interested in two inference questions firstgiven a sequence x evaluate its (log) marginallikelihood second given x find the most likelydecomposition into segments slowast

Marginal likelihood To efficiently compute themarginal likelihood we use a variant of the forwardalgorithm for semi-Markov models (Yu 2010)which incrementally computes a sequence of prob-abilities αi where αi is the marginal likelihood ofgenerating xlei and concluding a segment at timei Although there are an exponential number ofsegmentations of x these values can be computedusing O(|x|) space and O(|x|2) time as

α0 = 1 αt =

tminus1sumj=tminusL

αjp(s = xjt | xltj)

(3)

By letting xt+1 = 〈S〉 then p(x) = αt+1

Most probable segmentation The most proba-ble segmentation of a sequence x can be computedby replacing the summation with a max operatorin Eq 3 and maintaining backpointers

4 Expected length regularization

When the lexical memory contains all the sub-strings in the training data the model easily over-fits by copying the longest continuation from thememory To prevent overfitting we introduce a reg-ularizer that penalizes based on the expectation ofthe exponentiated (by a hyperparameter β) lengthof each segment

R(x β) =sum

sπ(s)=x

p(s | x)sumsisins|s|β

This can be understood as a regularizer based onthe double exponential prior identified to be ef-fective in previous work (Liang and Klein 2009Berg-Kirkpatrick et al 2010) This expectation

is a differentiable function of the model parame-ters Because of the linearity of the penalty acrosssegments it can be computed efficiently using theabove dynamic programming algorithm under theexpectation semiring (Eisner 2002) This is par-ticularly efficient since the expectation semiringjointly computes the expectation and marginal like-lihood in a single forward pass For more detailsabout computing gradients of expectations underdistributions over structured objects with dynamicprograms and semirings see Li and Eisner (2009)

41 Training ObjectiveThe model parameters are trained by minimizingthe penalized log likelihood of a training corpus Dof unsegmented sentences

L =sumxisinD

[minus log p(x) + λR(x β)]

5 Datasets

We evaluate our model on both English and Chi-nese segmentation For both languages we usedstandard datasets for word segmentation and lan-guage modeling We also use MS-COCO to evalu-ate how the model can leverage conditioning con-text information For all datasets we used trainvalidation and test splits2 Since our model assumesa closed character set we removed validation andtest samples which contain characters that do notappear in the training set In the English corporawhitespace characters are removed In Chinesethey are not present to begin with Refer to Ap-pendix A for dataset statistics

51 EnglishBrent Corpus The Brent corpus is a standardcorpus used in statistical modeling of child lan-guage acquisition (Brent 1999 Venkataraman2001)3 The corpus contains transcriptions of utter-ances directed at 13- to 23-month-old children Thecorpus has two variants an orthographic one (BR-text) and a phonemic one (BR-phono) where eachcharacter corresponds to a single English phonemeAs the Brent corpus does not have a standard trainand test split and we want to tune the parametersby measuring the fit to held-out data we used thefirst 80 of the utterances for training and the next10 for validation and the rest for test

2The data and splits used are available athttpss3eu-west-2amazonawscomk-kawakamisegzip

3httpschildestalkbankorgderived

6433

English Penn Treebank (PTB) We use the com-monly used version of the PTB prepared byMikolov et al (2010) However since we removedspace symbols from the corpus our cross entropyresults cannot be compared to those usually re-ported on this dataset

52 Chinese

Since Chinese orthography does not mark spacesbetween words there have been a number of effortsto annotate word boundaries We evaluate againsttwo corpora that have been manually segmentedaccording different segmentation standards

Beijing University Corpus (PKU) The BeijingUniversity Corpus was one of the corpora usedfor the International Chinese Word SegmentationBakeoff (Emerson 2005)

Chinese Penn Treebank (CTB) We use thePenn Chinese Treebank Version 51 (Xue et al2005) It generally has a coarser segmentation thanPKU (eg in CTB a full name consisting of agiven name and family name is a single token)and it is a larger corpus

53 Image Caption Dataset

To assess whether jointly learning about meaningsof words from non-linguistic context affects seg-mentation performance we use image and captionpairs from the COCO caption dataset (Lin et al2014) We use 10000 examples for both trainingand testing and we only use one reference per im-age The images are used to be conditional contextto predict captions Refer to Appendix B for thedataset construction process

6 Experiments

We compare our model to benchmark Bayesianmodels which are currently the best known unsu-pervised word discovery models as well as to asimple deterministic segmentation criterion basedon surprisal peaks (Elman 1990) on language mod-eling and segmentation performance Althoughthe Bayeisan models are shown to able to discoverplausible word-like units we found that a set of hy-perparameters that provides best performance withsuch model on language modeling does not pro-duce good structures as reported in previous worksThis is problematic since there is no objective cri-teria to find hyperparameters in fully unsupervisedmanner when the model is applied to completely

unknown languages or domains Thus our experi-ments are designed to assess how well the modelsinfers word segmentations of unsegmented inputswhen they are trained and tuned to maximize thelikelihood of the held-out text

DPHDP Benchmarks Among the most effec-tive existing word segmentation models are thosebased on hierarchical Dirichlet process (HDP) mod-els (Goldwater et al 2009 Teh et al 2006) and hi-erarchical PitmanndashYor processes (Mochihashi et al2009) As a representative of these we use a simplebigram HDP model

θmiddot sim DP(α0 p0)

θmiddot|s sim DP(α1 θmiddot) foralls isin Σlowast

st+1 | st sim Categorical(θmiddot|st)

The base distribution p0 is defined over strings inΣlowastcup〈S〉 by deciding with a specified probabilityto end the utterance a geometric length model anda uniform probability over Σ at a each position In-tuitively it captures the preference for having shortwords in the lexicon In addition to the HDP modelwe also evaluate a simpler single Dirichlet process(DP) version of the model in which the strsquos aregenerated directly as draws from Categorical(θmiddot)We use an empirical Bayesian approach to selecthyperparameters based on the likelihood assignedby the inferred posterior to a held-out validationset Refer to Appendix D for details on inference

Deterministic Baselines Incremental word seg-mentation is inherently ambiguous (eg the let-ters the might be a single word or they might bethe beginning of the longer word theater) Never-theless several deterministic functions of prefixeshave been proposed in the literature as strategies fordiscovering rudimentary word-like units hypothe-sized for being useful for bootstrapping the lexicalacquisition process or for improving a modelrsquos pre-dictive accuracy These range from surprisal crite-ria (Elman 1990) to sophisticated language modelsthat switch between models that capture intra- andinter-word dynamics based on deterministic func-tions of prefixes of characters (Chung et al 2017Shen et al 2018)

In our experiments we also include such deter-ministic segmentation results using (1) the surprisalcriterion of Elman (1990) and (2) a two-level hi-erarchical multiscale LSTM (Chung et al 2017)which has been shown to predict boundaries in

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 5: Learning to Discover, Ground and Use Words with Segmental

6433

English Penn Treebank (PTB) We use the com-monly used version of the PTB prepared byMikolov et al (2010) However since we removedspace symbols from the corpus our cross entropyresults cannot be compared to those usually re-ported on this dataset

52 Chinese

Since Chinese orthography does not mark spacesbetween words there have been a number of effortsto annotate word boundaries We evaluate againsttwo corpora that have been manually segmentedaccording different segmentation standards

Beijing University Corpus (PKU) The BeijingUniversity Corpus was one of the corpora usedfor the International Chinese Word SegmentationBakeoff (Emerson 2005)

Chinese Penn Treebank (CTB) We use thePenn Chinese Treebank Version 51 (Xue et al2005) It generally has a coarser segmentation thanPKU (eg in CTB a full name consisting of agiven name and family name is a single token)and it is a larger corpus

53 Image Caption Dataset

To assess whether jointly learning about meaningsof words from non-linguistic context affects seg-mentation performance we use image and captionpairs from the COCO caption dataset (Lin et al2014) We use 10000 examples for both trainingand testing and we only use one reference per im-age The images are used to be conditional contextto predict captions Refer to Appendix B for thedataset construction process

6 Experiments

We compare our model to benchmark Bayesianmodels which are currently the best known unsu-pervised word discovery models as well as to asimple deterministic segmentation criterion basedon surprisal peaks (Elman 1990) on language mod-eling and segmentation performance Althoughthe Bayeisan models are shown to able to discoverplausible word-like units we found that a set of hy-perparameters that provides best performance withsuch model on language modeling does not pro-duce good structures as reported in previous worksThis is problematic since there is no objective cri-teria to find hyperparameters in fully unsupervisedmanner when the model is applied to completely

unknown languages or domains Thus our experi-ments are designed to assess how well the modelsinfers word segmentations of unsegmented inputswhen they are trained and tuned to maximize thelikelihood of the held-out text

DPHDP Benchmarks Among the most effec-tive existing word segmentation models are thosebased on hierarchical Dirichlet process (HDP) mod-els (Goldwater et al 2009 Teh et al 2006) and hi-erarchical PitmanndashYor processes (Mochihashi et al2009) As a representative of these we use a simplebigram HDP model

θmiddot sim DP(α0 p0)

θmiddot|s sim DP(α1 θmiddot) foralls isin Σlowast

st+1 | st sim Categorical(θmiddot|st)

The base distribution p0 is defined over strings inΣlowastcup〈S〉 by deciding with a specified probabilityto end the utterance a geometric length model anda uniform probability over Σ at a each position In-tuitively it captures the preference for having shortwords in the lexicon In addition to the HDP modelwe also evaluate a simpler single Dirichlet process(DP) version of the model in which the strsquos aregenerated directly as draws from Categorical(θmiddot)We use an empirical Bayesian approach to selecthyperparameters based on the likelihood assignedby the inferred posterior to a held-out validationset Refer to Appendix D for details on inference

Deterministic Baselines Incremental word seg-mentation is inherently ambiguous (eg the let-ters the might be a single word or they might bethe beginning of the longer word theater) Never-theless several deterministic functions of prefixeshave been proposed in the literature as strategies fordiscovering rudimentary word-like units hypothe-sized for being useful for bootstrapping the lexicalacquisition process or for improving a modelrsquos pre-dictive accuracy These range from surprisal crite-ria (Elman 1990) to sophisticated language modelsthat switch between models that capture intra- andinter-word dynamics based on deterministic func-tions of prefixes of characters (Chung et al 2017Shen et al 2018)

In our experiments we also include such deter-ministic segmentation results using (1) the surprisalcriterion of Elman (1990) and (2) a two-level hi-erarchical multiscale LSTM (Chung et al 2017)which has been shown to predict boundaries in

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 6: Learning to Discover, Ground and Use Words with Segmental

6434

whitespace-containing character sequences at posi-tions corresponding to word boundaries As withall experiments in this paper the BR-corpora forthis experiment do not contain spaces

SNLM Model configurations and EvaluationLSTMs had 512 hidden units with parameterslearned using the Adam update rule (Kingma andBa 2015) We evaluated our models with bits-per-character (bpc) and segmentation accuracy (Brent1999 Venkataraman 2001 Goldwater et al 2009)Refer to Appendices CndashF for details of model con-figurations and evaluation metrics

For the image caption dataset we extend themodel with a standard attention mechanism in thebackbone LSTM (LSTMenc) to incorporate imagecontext For every character-input the model calcu-lates attentions over image features and use them topredict the next characters As for image represen-tations we use features from the last convolutionlayer of a pre-trained VGG19 model (Simonyanand Zisserman 2014)

7 Results

In this section we first do a careful comparison ofsegmentation performance on the phonemic Brentcorpus (BR-phono) across several different segmen-tation baselines and we find that our model obtainscompetitive segmentation performance Addition-ally ablation experiments demonstrate that bothlexical memory and the proposed expected lengthregularization are necessary for inferring good seg-mentations We then show that also on other cor-pora we likewise obtain segmentations better thanbaseline models Finally we also show that ourmodel has superior performance in terms of held-out perplexity compared to a character-level LSTMlanguage model Thus overall our results showthat we can obtain good segmentations on a vari-ety of tasks while still having very good languagemodeling performance

Word Segmentation (BR-phono) Table 1 sum-marizes the segmentation results on the widelyused BR-phono corpus comparing it to a varietyof baselines Unigram DP Bigram HDP LSTMsuprisal and HMLSTM refer to the benchmarkmodels explained in sect6 The ablated versions of ourmodel show that without the lexicon (minusmemory)without the expected length penalty (minuslength) andwithout either our model fails to discover good seg-mentations Furthermore we draw attention to the

difference in the performance of the HDP and DPmodels when using subjective settings of the hyper-parameters and the empirical settings (likelihood)Finally the deterministic baselines are interestingin two ways First LSTM surprisal is a remarkablygood heuristic for segmenting text (although wewill see below that its performance is much lessgood on other datasets) Second despite carefultuning the HMLSTM of Chung et al (2017) failsto discover good segments although in their paperthey show that when spaces are present betweenHMLSTMs learn to switch between their internalmodels in response to them

Furthermore the priors used in the DPHDPmodels were tuned to maximize the likelihood as-signed to the validation set by the inferred poste-rior predictive distribution in contrast to previouspapers which either set them subjectively or in-ferred them (Johnson and Goldwater 2009) Forexample the DP and HDP model with subjectivepriors obtained 538 and 723 F1 scores respec-tively (Goldwater et al 2009) However when thehyperparameters are set to maximize held-out like-lihood this drops obtained 561 and 569 Anotherresult on this dataset is the feature unigram modelof Berg-Kirkpatrick et al (2010) which obtainsan 880 F1 score with hand-crafted features andby selecting the regularization strength to optimizesegmentation performance Once the features areremoved the model achieved a 715 F1 score whenit is tuned on segmentation performance and only115 when it is tuned on held-out likelihood

P R F1

LSTM surprisal (Elman 1990) 545 555 550HMLSTM (Chung et al 2017) 81 133 101

Unigram DP 633 504 561Bigram HDP 530 614 569SNLM (minusmemory minuslength) 543 349 425SNLM (+memory minuslength) 524 368 433SNLM (minusmemory +length) 576 434 495SNLM (+memory +length) 813 775 793

Table 1 Summary of segmentation performance onphoneme version of the Brent Corpus (BR-phono)

Word Segmentation (other corpora) Table 2summarizes results on the BR-text (orthographicBrent corpus) and Chinese corpora As in the pre-vious section all the models were trained to maxi-

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 7: Learning to Discover, Ground and Use Words with Segmental

6435

mize held-out likelihood Here we observe a simi-lar pattern with the SNLM outperforming the base-line models despite the tasks being quite differentfrom each other and from the BR-phono task

P R F1

BR-text

LSTM surprisal 364 490 417Unigram DP 649 557 600Bigram HDP 525 631 573SNLM 687 789 735

PTB

LSTM surprisal 273 365 312Unigram DP 510 491 500Bigram HDP 348 473 401SNLM 541 601 569

CTB

LSTM surprisal 416 256 317Unigram DP 618 496 550Bigram HDP 673 677 675SNLM 781 815 798

PKU

LSTM surprisal 381 230 287Unigram DP 602 482 536Bigram HDP 668 671 669SNLM 750 712 731

Table 2 Summary of segmentation performance onother corpora

Word Segmentation Qualitative Analysis Weshow some representative examples of segmenta-tions inferred by various models on the BR-text andPKU corpora in Table 3 As reported in Goldwateret al (2009) we observe that the DP models tendto undersegment keep long frequent sequences to-gether (eg they failed to separate articles) HDPsdo successfully prevent oversegmentation how-ever we find that when trained to optimize held-out likelihood they often insert unnecessary bound-aries between words such as yo u Our modelrsquos per-formance is better but it likewise shows a tendencyto oversegment Interestingly we can observe a ten-dency tends to put boundaries between morphemesin morphologically complex lexical items such asdumpty rsquos and go ing Since morphemes are theminimal units that carry meaning in language thissegmentation while incorrect is at least plasuibleTurning to the Chinese examples we see that bothbaseline models fail to discover basic words suchas山间 (mountain) and人们 (human)

Finally we observe that none of the models suc-cessfully segment dates or numbers containing mul-

tiple digits (all oversegment) Since number typestend to be rare they are usually not in the lexiconmeaning our model (and the HDP baselines) mustgenerate them as character sequences

Language Modeling Performance The aboveresults show that the SNLM infers good word seg-mentations We now turn to the question of howwell it predicts held-out data Table 4 summa-rizes the results of the language modeling exper-iments Again we see that SNLM outperformsthe Bayesian models and a character LSTM Al-though there are numerous extensions to LSTMs toimprove language modeling performance LSTMsremain a strong baseline (Melis et al 2018)

One might object that because of the lexiconthe SNLM has many more parameters than thecharacter-level LSTM baseline model Howeverunlike parameters in LSTM recurrence which areused every timestep our memory parameters areaccessed very sparsely Furthermore we observedthat an LSTM with twice the hidden units did notimprove the baseline with 512 hidden units on bothphonemic and orthographic versions of Brent cor-pus but the lexicon could This result suggests morehidden units are useful if the model does not haveenough capacity to fit larger datasets but that thememory structure adds other dynamics which arenot captured by large recurrent networks

Multimodal Word Segmentation Finally wediscuss results on word discovery with non-linguistic context (image) Although there is muchevidence that neural networks can reliably learnto exploit additional relevant context to improvelanguage modeling performance (eg machinetranslation and image captioning) it is still unclearwhether the conditioning context help to discoverstructure in the data We turn to this question hereTable 5 summarizes language modeling and seg-mentation performance of our model and a baselinecharacter-LSTM language model on the COCO im-age caption dataset We use the Elman Entropycriterion to infer the segmentation points from thebaseline LM and the MAP segmentation underour model Again we find our model outperformsthe baseline model in terms of both language mod-eling and word segmentation accuracy Interest-ingly we find while conditioning on image contextleads to reductions in perplexity in both modelsin our model the presence of the image further im-proves segmentation accuracy This suggests that

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 8: Learning to Discover, Ground and Use Words with Segmental

6436

Examples

BR-text

Reference are you going to make him pretty this morningUnigram DP areyou goingto makehim pretty this morningBigram HDP areyou go ingto make him p retty this mo rn ingSNLM are you go ing to make him pretty this morning

Reference would you like to do humpty dumptyrsquos buttonUnigram DP wouldyoul iketo do humpty dumpty rsquos buttonBigram HDP would youlike to do humptyd umpty rsquos butt onSNLM would you like to do humpty dumpty rsquos button

PKU

Reference 笑声 掌声 欢呼声 在 山间 回荡 勾 起 了 人们 对 往事 的 回忆 Unigram DP 笑声 掌声 欢呼 声 在 山 间 回荡 勾 起了 人们对 往事 的 回忆 Bigram HDP 笑 声 掌声 欢 呼声 在 山 间 回 荡 勾 起了 人 们对 往事 的 回忆 SNLM 笑声 掌声 欢呼声 在 山间 回荡 勾起 了 人们 对 往事 的 回忆

Reference 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙 Unigram DP 不得 在 江河电缆 保护 区内抛锚 拖锚 炸鱼挖沙 Bigram HDP 不得 在 江 河 电缆 保护 区内 抛 锚拖 锚 炸鱼 挖沙 SNLM 不得 在 江河 电缆 保护区 内 抛锚 拖锚 炸鱼 挖沙

Table 3 Examples of predicted segmentations on English and Chinese

BR-text BR-phono PTB CTB PKU

Unigram DP 233 293 225 616 688Bigram HDP 196 255 180 540 642LSTM 203 262 165 494 620

SNLM 194 254 156 484 589

Table 4 Test language modeling performance (bpc)

our model and its learning mechanism interact withthe conditional context differently than the LSTMdoes

To understand what kind of improvements insegmentation performance the image context leadsto we annotated the tokens in the references withpart-of-speech (POS) tags and compared relativeimprovements on recall between SNLM (minusimage)and SNLM (+image) among the five POS tagswhich appear more than 10000 times We ob-served improvements on ADJ (+45) NOUN(+41) VERB (+31) The improvements onthe categories ADP (+05) and DET (+03)are were more limited The categories where wesee the largest improvement in recall correspondto those that are likely a priori to correlate mostreliably with observable features Thus this resultis consistent with a hypothesis that the lexican issuccessfully acquiring knowledge about how wordsidiosyncratically link to visual features

Segmentation State-of-the-Art The results re-ported are not the best-reported numbers on the En-

bpcdarr P uarr R uarr F1uarr

Unigram DP 223 440 400 419Bigram HDP 168 309 408 351LSTM (minusimage) 155 313 382 344SNLM (minusimage) 152 398 553 463

LSTM (+image) 142 317 391 350SNLM (+image) 138 464 620 531

Table 5 Language modeling (bpc) and segmentationaccuracy on COCO dataset +image indicates that themodel has access to image context

glish phoneme or Chinese segmentation tasks Aswe discussed in the introduction previous work hasfocused on segmentation in isolation from languagemodeling performance Models that obtain bettersegmentations include the adaptor grammars (F1870) of Johnson and Goldwater (2009) and thefeature-unigram model (880) of Berg-Kirkpatricket al (2010) While these results are better in termsof segmentation they are weak language models(the feature unigram model is effectively a unigramword model the adaptor grammar model is effec-tively phrasal unigram model both are incapable ofgeneralizing about substantially non-local depen-dencies) Additionally the features and grammarsused in prior work reflect certain English-specificdesign considerations (eg syllable structure in thecase of adaptor grammars and phonotactic equiva-lence classes in the feature unigram model) whichmake them questionable models if the goal is to ex-

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 9: Learning to Discover, Ground and Use Words with Segmental

6437

plore what models and biases enable word discov-ery in general For Chinese the best nonparametricmodels perform better at segmentation (Zhao andKit 2008 Mochihashi et al 2009) but again theyare weaker language models than neural modelsThe neural model of Sun and Deng (2018) is similarto our model without lexical memory or length reg-ularization it obtains 802 F1 on the PKU datasethowever it uses gold segmentation data duringtraining and hyperparameter selection4 whereasour approach requires no gold standard segmenta-tion data

8 Related Work

Learning to discover and represent temporally ex-tended structures in a sequence is a fundamentalproblem in many fields For example in languageprocessing unsupervised learning of multiple lev-els of linguistic structures such as morphemes (Sny-der and Barzilay 2008) words (Goldwater et al2009 Mochihashi et al 2009 Wang et al 2014)and phrases (Klein and Manning 2001) have beeninvestigated Recently speech recognition has ben-efited from techniques that enable the discoveryof subword units (Chan et al 2017 Wang et al2017) however in that work the optimally dis-covered character sequences look quite unlike or-thographic words In fact the model proposed byWang et al (2017) is essentially our model with-out a lexicon or the expected length regularizationie (minusmemory minuslength) which we have shownperforms quite poorly in terms of segmentation ac-curacy Finally some prior work has also sought todiscover lexical units directly from speech basedon speech-internal statistical regularities (Kam-per et al 2016) as well as jointly with ground-ing (Chrupała et al 2017)

9 Conclusion

Word discovery is a fundamental problem in lan-guage acquisition While work studying the prob-lem in isolation has provided valuable insights(showing both what data is sufficient for word dis-covery with which models) this paper shows thatneural models offer the flexibility and performanceto productively study the various facets of the prob-lem in a more unified model While this work uni-fies several components that had previously been

4httpsgithubcomEdward-SunSLMblobd37ad735a7b1d5af430b96677c2ecf37a65f59b7codesrunpyL329

studied in isolation our model assumes access tophonetic categories The development of thesecategories likely interact with the development ofthe lexicon and acquisition of semantics (Feldmanet al 2013 Fourtassi and Dupoux 2014) and thussubsequent work should seek to unify more aspectsof the acquisition problem

AcknowledgmentsWe thank Mark Johnson Sharon Goldwater andEmmanuel Dupoux as well as many colleagues atDeepMind for their insightful comments and sug-gestions for improving this work and the resultingpaper

ReferencesTaylor Berg-Kirkpatrick Alexandre Bouchard-Cocircteacute

John DeNero and Dan Klein 2010 Painless unsu-pervised learning with features In Proc NAACL

Michael R Brent 1999 An efficient probabilisticallysound algorithm for segmentation and word discov-ery Machine Learning 34(1)71ndash105

Michael R Brent and Timothy A Cartwright 1996 Dis-tributional regularity and phonotactic constraints areuseful for segmentation Cognition 61(1)93ndash125

William Chan Yu Zhang Quoc Le and Navdeep Jaitly2017 Latent sequence decompositions In ProcICLR

Grzegorz Chrupała Lieke Gelderloos and Afra Al-ishahi 2017 Representations of language in amodel of visually grounded speech signal

Junyoung Chung Sungjin Ahn and Yoshua Bengio2017 Hierarchical multiscale recurrent neural net-works In Proc ICLR

Jason Eisner 2002 Parameter estimation for proba-bilistic finite-state transducers In Proc ACL

Jeffrey L Elman 1990 Finding structure in time Cog-nitive science 14(2)179ndash211

Thomas Emerson 2005 The second international Chi-nese word segmentation bakeoff In Proc SIGHANWorkshop

Naomi H Feldman Thomas L Griffiths Sharon Gold-water and James L Morgan 2013 A role for thedeveloping lexicon in phonetic category acquisitionPsychological Review 120(4)751ndash778

Abdellah Fourtassi and Emmanuel Dupoux 2014 Arudimentary lexicon and semantics help bootstrapphoneme acquisition In Proc EMNLP

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 10: Learning to Discover, Ground and Use Words with Segmental

6438

Lieke Gelderloos and Grzegorz Chrupała 2016 Fromphonemes to images levels of representation in a re-current neural model of visually-grounded languagelearning In Proc COLING

Sharon Goldwater Thomas L Griffiths and Mark John-son 2009 A Bayesian framework for word segmen-tation Exploring the effects of context Cognition112(1)21ndash54

Alex Graves 2012 Sequence transduction withrecurrent neural networks arXiv preprintarXiv12113711

Alex Graves 2013 Generating sequences withrecurrent neural networks arXiv preprintarXiv13080850

Mark Johnson Katherine Demuth Michael Frank andBevan K Jones 2010 Synergies in learning wordsand their referents In Proc NIPS

Mark Johnson and Sharon Goldwater 2009 Improvingnonparameteric bayesian inference experiments onunsupervised word segmentation with adaptor gram-mars In Proc NAACL pages 317ndash325

Aacutekos Kaacutedaacuter Marc-Alexandre Cocircteacute Grzegorz Chru-pała and Afra Alishahi 2018 Revisiting the hier-archical multiscale LSTM In Proc COLING

Herman Kamper Aren Jansen and Sharon Goldwater2016 Unsupervised word segmentation and lexi-con induction discovery using acoustic word embed-dings IEEE Transactions on Audio Speech andLanguage Processing 24(4)669ndash679

Kazuya Kawakami Chris Dyer and Phil Blunsom2017 Learning to create and reuse words in open-vocabulary neural language modeling In ProcACL

Diederik Kingma and Jimmy Ba 2015 Adam Amethod for stochastic optimization In Proc ICLR

Dan Klein and Christopher D Manning 2001 Distribu-tional phrase structure induction In Workshop ProcACL

Zhifei Li and Jason Eisner 2009 First-and second-order expectation semirings with applications tominimum-risk training on translation forests InProc EMNLP

Percy Liang and Dan Klein 2009 Online EM for un-supervised models In Proc NAACL

Tsung-Yi Lin Michael Maire Serge Belongie JamesHays Pietro Perona Deva Ramanan Piotr Dollaacuterand C Lawrence Zitnick 2014 Microsoft COCOCommon objects in context In Proc ECCV pages740ndash755

Wang Ling Edward Grefenstette Karl Moritz Her-mann Tomaacuteš Kociskyacute Andrew Senior FuminWang and Phil Blunsom 2017 Latent predictor net-works for code generation In Proc ACL

Gaacutebor Melis Chris Dyer and Phil Blunsom 2018 Onthe state of the art of evaluation in neural languagemodels In Proc ICLR

Sebastian J Mielke and Jason Eisner 2018 Spell oncesummon anywhere A two-level open-vocabularylanguage model In Proc NAACL

Tomas Mikolov Martin Karafiaacutet Lukas Burget JanCernocky and Sanjeev Khudanpur 2010 Recurrentneural network based language model In Proc In-terspeech

Daichi Mochihashi Takeshi Yamada and NaonoriUeda 2009 Bayesian unsupervised word segmen-tation with nested PitmanndashYor language modeling

Steven Pinker 1984 Language learnability and lan-guage development Harvard University Press

Okko Raumlsaumlnen and Heikki Rasilo 2015 A jointmodel of word segmentation and meaning acquisi-tion through cross-situational learning Psychologi-cal Review 122(4)792ndash829

Jenny R Saffran Richard N Aslin and Elissa L New-port 1996 Statistical learning by 8-month-old in-fants Science 274(5294)1926ndash1928

Yikang Shen Zhouhan Lin Chin-Wei Huang andAaron Courville 2018 Neural language modelingby jointly learning syntax and lexicon In ProcICLR

K Simonyan and A Zisserman 2014 Very deep con-volutional networks for large-scale image recogni-tion CoRR abs14091556

Benjamin Snyder and Regina Barzilay 2008 Unsuper-vised multilingual learning for morphological seg-mentation In Proc ACL

Zhiqing Sun and Zhi-Hong Deng 2018 Unsupervisedneural word segmentation for Chinese via segmentallanguage modeling

Yee-Whye Teh Michael I Jordan Matthew J Bealand Daivd M Blei 2006 Hierarchical Dirichlet pro-cesses Journal of the American Statistical Associa-tion 101(476)1566ndash1581

Anand Venkataraman 2001 A statistical model forword discovery in transcribed speech Computa-tional Linguistics 27(3)351ndash372

Chong Wang Yining Wan Po-Sen Huang Abdelrah-man Mohammad Dengyong Zhou and Li Deng2017 Sequence modeling via segmentations InProc ICML

Xiaolin Wang Masao Utiyama Andrew Finch and Ei-ichiro Sumita 2014 Empirical study of unsuper-vised Chinese word segmentation methods for SMTon large-scale corpora

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 11: Learning to Discover, Ground and Use Words with Segmental

6439

Naiwen Xue Fei Xia Fu-Dong Chiou and MartaPalmer 2005 The Penn Chinese treebank Phrasestructure annotation of a large corpus Natural lan-guage engineering 11(2)207ndash238

Shun-Zheng Yu 2010 Hidden semi-Markov modelsArtificial Intelligence 174(2)215ndash243

Hai Zhao and Chunyu Kit 2008 An empirical com-parison of goodness measures for unsupervised Chi-nese word segmentation with a unified frameworkIn Proc IJCNLP

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 12: Learning to Discover, Ground and Use Words with Segmental

6440

A Dataset statistics

Table 6 summarizes dataset statistics

B Image Caption Dataset Construction

We use 8000 2000 and 10000 images fortrain development and test set in order of in-teger ids specifying image in cocoapi5 and usefirst annotation provided for each image Wewill make pairs of image id and annotationid available from httpss3eu-west-2amazonawscomk-kawakamisegzip

C SNLM Model Configuration

For each RNN based model we used 512 dimen-sions for the character embeddings and the LSTMshave 512 hidden units All the parameters includ-ing character projection parameters are randomlysampled from uniform distribution from minus008 to008 The initial hidden and memory state of theLSTMs are initialized with zero A dropout rate of05 was used for all but the recurrent connections

To restrict the size of memory we stored sub-strings which appeared F -times in the training cor-pora and tuned F with grid search The maximumlength of subsequencesLwas tuned on the held-outlikelihood using a grid search Tab 7 summarizesthe parameters for each dataset Note that we didnot tune the hyperparameters on segmentation qual-ity to ensure that the models are trained in a purelyunsupervised manner assuming no reference seg-mentations are available

D DPHDP Inference

By integrating out the draws from the DPrsquos it ispossible to do inference using Gibbs sampling di-rectly in the space of segmentation decisions Weuse 1000 iterations with annealing to find an ap-proximation of the MAP segmentation and thenuse the corresponding posterior predictive distribu-tion to estimate the held-out likelihood assignedby the model marginalizing the segmentations us-ing appropriate dynamic programs The evaluatedsegmentation was the most probable segmentationaccording to the posterior predictive distribution

In the original Bayesian segmentation work thehyperparameters (ie α0 α1 and the componentsof p0) were selected subjectively To make com-parison with our neural models fairer we insteadused an empirical approach and set them using the

5httpsgithubcomcocodatasetcocoapi

held-out likelihood of the validation set Howeversince this disadvantages the DPHDP models interms of segmentation we also report the originalresults on the BR corpora

E Learning

The models were trained with the Adam updaterule (Kingma and Ba 2015) with a learning rate of001 The learning rate is divided by 4 if there is noimprovement on development data The maximumnorm of the gradients was clipped at 10

F Evaluation Metrics

Language Modeling We evaluated our modelswith bits-per-character (bpc) a standard evalua-tion metric for character-level language modelsFollowing the definition in Graves (2013) bits-per-character is the average value ofminus log2 p(xt | xltt)over the whole test set

bpc = minus 1

|x|log2 p(x)

where |x| is the length of the corpus in charactersThe bpc is reported on the test set

Segmentation We also evaluated segmentationquality in terms of precision recall and F1 of wordtokens (Brent 1999 Venkataraman 2001 Gold-water et al 2009) To get credit for a word themodels must correctly identify both the left andright boundaries For example if there is a pair ofa reference segmentation and a prediction

Reference do you see a boy

Prediction doyou see a boy

then 4 words are discovered in the prediction wherethe reference has 5 words 3 words in the predictionmatch with the reference In this case we reportscores as precision = 750 (34) recall = 600 (35)and F1 the harmonic mean of precision and recall667 (23) To facilitate comparison with previ-ous work segmentation results are reported on theunion of the training validation and test sets

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used

Page 13: Learning to Discover, Ground and Use Words with Segmental

6441

Sentence Char Types Word Types Characters Average Word Length

Train Valid Test Train Valid Test Train Valid Test Train Valid Test Train Valid Test

BR-text 7832 979 979 30 30 29 1237 473 475 129k 16k 16k 382 406 383BR-phono 7832 978 978 51 51 50 1183 457 462 104k 13k 13k 286 297 283PTB 42068 3370 3761 50 50 48 10000 6022 6049 51M 400k 450k 444 437 441CTB 50734 349 345 160 76 76 60095 1769 1810 31M 18k 22k 484 507 514PKU 17149 1841 1790 90 84 87 52539 13103 11665 26M 247k 241k 493 494 485COCO 8000 2000 10000 50 42 48 4390 2260 5072 417k 104k 520k 400 399 399

Table 6 Summary of Dataset Statistics

max len (L) min freq (F) λ

BR-text 10 10 75e-4BR-phono 10 10 95e-4PTB 10 100 50e-5CTB 5 25 10e-2PKU 5 25 90e-3COCO 10 100 20e-4

Table 7 Hyperparameter values used