children’s production of unfamiliar word sequences is predicted by positional variability and...

Children’s Production of Unfamiliar Word Sequences IsPredicted by Positional Variability and Latent Classes in a

Large Sample of Child-Directed Speech

Danielle Matthews,a Colin Bannardb

aDepartment of Psychology, University of SheffieldbDepartment of Linguistics, University of Texas at Austin

Received 23 May 2009; received in revised form 18 November 2009; accepted 23 November 2009

Abstract

We explore whether children’s willingness to produce unfamiliar sequences of words reflects their

experience with similar lexical patterns. We asked children to repeat unfamiliar sequences that were

identical to familiar phrases (e.g., A piece of toast) but for one word (e.g., a novel instantiation of Apiece of X, like A piece of brick). We explore two predictions—motivated by findings in the statisti-

cal learning literature—that children are likely to have detected an opportunity to substitute alterna-

tive words into the final position of a four-word sequence if (a) it is difficult to predict the fourth

word given the first three words and (b) the words observed in the final position are distributionally

similar. Twenty-eight 2-year-olds and thirty-one 3-year-olds were significantly more likely to cor-

rectly repeat unfamiliar variants of patterns for which these properties held. The results illustrate

how children’s developing language is shaped by linguistic experience.

Keywords: Cognitive development; Language acquisition; Statistical learning; Syntax; Corpus

analysis; Information theory; Latent classes; Usage-based models of language

1. Introduction

Faced with a stream of speech sounds and gestures, most infants begin to identify the

units of their language and discover the potential for recombining them within the first

2 years. Quite how this is achieved is one of the most challenging questions in cognitive sci-

ence. In the last decade, a very large literature has explored a number of skills that might be

useful. It has been reported that children can use basic ‘‘statistical learning’’ mechanisms to

take such crucial developmental steps as segmenting the input into ‘‘word-like’’ units

Correspondence should be sent to Danielle Matthews, Department of Psychology, University of Sheffield,

Western Bank, Sheffield S10 2TP United Kingdom. E-mail: [email protected]

Cognitive Science 34 (2010) 465–488Copyright � 2010 Cognitive Science Society, Inc. All rights reserved.ISSN: 0364-0213 print / 1551-6709 onlineDOI: 10.1111/j.1551-6709.2009.01091.x

(e.g., Saffran, Aslin, & Newport, 1996), assigning sounds to ‘‘categories’’ based on their

co-occurrence with other sounds (Gomez & Lakusta, 2004) and identifying nonadjacent

dependencies (Gomez, 2002; Gomez & Maye, 2005). This research has been conducted

using artificial stimuli—sequences of meaning-free sounds from which the infants are able

to extract language-like structure using simple pattern detection. The use of such artificial

stimuli is valuable in isolating specific input characteristics and learning mechanisms. How-

ever, it remains unclear whether these same mechanisms would be at work in a natural

learning context. Natural language is of course far noisier than artificial stimuli and rarely

displays patterns or statistical structure with the same clear consistency. Crucially, while

infants seem to be able to observe patterns in synthetic data from a very young age, it is not

clear that they are able to utilize these skills in communicative contexts until sometime later

in development. There is thus some work to be done to bridge the gap between these extre-

mely valuable findings and real language development (Pelucchi, Hay, & Saffran, 2009;

Johnson & Tyler, in press on word segmentation in natural language).

In this paper, we report on a study that examines children’s grammar learning by per-

forming a statistical analysis of a large sample of real input data and using this to make pre-

dictions about children’s ability to produce particular sequences of words in a sentence

repetition task. The sentence repetition task allows us to test young children, on the cusp of

multiword speech, with a procedure that has been tried and tested by many researchers from

differing theoretical backgrounds (e.g., Bannard & Matthews, 2008; Potter & Lombardi,

1990; Valian & Aubry, 2005). Using real English of course has some disadvantages, namely

that it can be challenging to find sufficient stimuli (where the properties of interest are

uncorrelated) while also controlling for other factors that would be presumed to affect pro-

duction (e.g., word frequency, phonological complexity). However, we think that it is a vital

complement to the artificial grammar learning work, and one of our objectives in this study

is to show that it is possible to control for many potential confounds via computational anal-

ysis of the input data and the use of appropriate methods for statistical analysis of the chil-

dren’s responses.

The aim of this study is to test whether the detailed statistics of the input are reflected in

children’s developing grammatical representations. We asked children to repeat unfamiliar

sequences of words that were identical to familiar phrases but for one word (e.g., a novel

instantiation of a frequent pattern like A piece of X, such as A piece of brick). These variants

were unattested in a large child language corpus and thus likely to be novel to most young

children or, at the least, unpracticed. We hypothesized that children’s ability to repeat such

unattested sequences would reflect their exposure to the relevant pattern in the given lexical

form. We thus rely on the assumption that children build lexically specific representations.

This assumption has been supported in a recent study (Bannard & Matthews, 2008) where

we found that 2- and 3-year-old children were significantly better at repeating the shared

first three words of frequently occurring multiword sequences than matched, infrequent

sequences (e.g., better at repeating ‘‘sit in your’’ when saying ‘‘sit in your chair’’ than when

saying ‘‘sit in your truck’’). It is worth noting that lexical patterns of the kind we are study-

ing here have been given a central role in so-called usage-based theories of development

(e.g., Tomasello, 2003; Goldberg, 2006), where they are sometimes referred to as

466 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)

‘‘constructions.’’ Because of the long history of the term construction in the linguistic

literature and some minor differences in how the term is applied even within the usage-

based literature, we prefer to use the terms schema or pattern in this article, but we

nonetheless consider the phenomenon we are discussing as entirely consistent with such

an approach.

So how might the statistics of the input affect children’s ability to produce unfamiliar

sequences of words that are similar to well-known phrases? One recurrent idea in the

literature on the learning of linguistic patterns is that children will be affected by what

has been called type frequency. The idea here is that children will identify a pattern in

the input where some invariant structure is combined with a wide range of other mate-

rial. For example, Gomez (2002) found that the ability of 18-month-olds to detect a

nonadjacent dependency between two sounds was predicted by the extent to which the

intervening element was varied in the artificial language they were exposed to. This idea

has also been popular in the study of morphology and its development (e.g., Bybee,

1985; Kempe, Brooks, Mironova, Pershukova, & Fedorova, 2007). Similar mechanisms

have been proposed for the learning of basic lexical patterns of the kind we are discuss-

ing here (e.g., Braine, 1976; Edelman, 2007; Freudenthal, Pine, Aguado-Orea, & Gobet,

2007; Lieven, Pine, & Baldwin, 1997; Pine & Lieven, 1997). Tomasello (2003) has

argued that children form the most basic of productive constructions through a process

of schematization. This is achieved when children hear repeated uses of one form (e.g.,

‘‘Throw’’) along with varied use of another form (e.g., noun phrases referring to what-

ever is thrown: ‘‘Throw the ball,’’ ‘‘Throw teddy,’’ and ‘‘Throw your bottle’’) in similar

contexts. The outcome is a linguistic construction that contains a minimum of one lexi-

cal item and one ‘‘slot’’ (Throw X).

Type frequency can thus be used to quantify how appropriate it is to generalize over a set

of similar utterances. One problem with type frequency, however, is that it does not take

into account the frequency distribution of the words filling a given slot. For example, if a

child hears the sequence ‘‘Throw your bottle’’ 118 times and ‘‘Throw the ball’’ and

‘‘Throw teddy’’ only once each, then we might not expect the same degree of productivity

with a potential ‘‘Throw X’’ construction as if all three sequences had been heard 40 times

each (although the type frequency would have been three in both cases). In the former

‘‘unequal’’ case, the child will always expect to hear ‘‘your bottle’’ after ‘‘throw’’ and thus

might not detect any potential for productivity. In the latter ‘‘equal’’ case, the child will be

uncertain as to which of three possible options will occur and therefore might be more likely

to form a productive slot. The intuitive difference between these situations can be quantified

with a measure of the entropy (Shannon & Weaver, 1949) of the slot, an index of the uncer-

tainty about which of all the possible words that could fill a slot is most likely to occur (see

also Hale, 2006; Keller, 2004; Levy, 2008; Moscoso del Prado Martın, Kostic, & Baayen,

2004; Moscoso del Prado Martın, Kostic, & Filipovic-Djurdjevic, unpublished data). This

entropy, which we will refer to as slot entropy, can be calculated as follows, where X is a

slot, each x is a word that appears in that slot, and p(x) is the probability of seeing each x in

that position. In the above example, then, the entropy in the unequal case is 0.14 and in the

equal case it is 1.58.

D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 467

HðXÞ ¼ �X

x2XpðxÞ log2 pðxÞ

Following the same reasoning as for type frequency, children should be more competent at

producing an unfamiliar sequence when it is an instantiation of a pattern for which a con-

crete alternative is maximally unpredictable (a pattern with high slot entropy). For example,

given two highly frequent utterances, ‘‘Back in the box’’ and ‘‘Let’s have a look,’’ that dif-

fer in the slot entropy for the final word position (in the corpus we used, the slot entropy for

‘‘Back in the X’’ was 5.31, for ‘‘Let’s have a X’’ it was 1.24), children should be more

likely to accept unfamiliar versions of the sequence that has the greater slot entropy than the

sequence with lower slot entropy (e.g., the unfamiliar sequence ‘‘Back in the town’’ should

be easier to produce than the unfamiliar sequence ‘‘Let’s have a think’’). Thus, the degree

to which children will be willing to extract and utilize invariant patterns will depend on the

entropy of its slot(s).

We predict, then, that a child will extract a productive pattern (identify a frame and a slot)

where there is high entropy. However, the problem is not as simple as determining where

there is and is not a slot. Children also face the problem of predicting what is allowed to

appear there—forming expectations about not only the exact words seen in a particular

position but also the kind of words to be seen. That is, children should have expectations

concerning whether a given word or phrase will be seen in a particular position based on its

similarity to the words that have been seen there before. Our target sequences were designed

to investigate the effect of latent classes—grouping of similar words—on children’s devel-

oping knowledge. The idea that speakers have knowledge about how words are similar to

other words is of course very widely accepted in linguistic theory—it is the basis for syntac-

tic categories. How exactly they determine this similarity is, however, not so clear. One way

in which words are similar to other words is in the similarity of the words or concepts to

which they are used to refer. However, although we know that human infants are remarkably

good at generalizing across stimuli that are similar (e.g., Shepard, 1987), gauging effects of

semantic similarity is notoriously difficult because of the lack of a widely accepted theory

of mental representation and semantic cognition.

Another way in which words display similarities and dissimilarities is in their distribution

relative to other words (Harris, 1964). Learners also seem to be able to exploit this informa-

tion. For example, it has been shown in an artificial grammar learning study (Gomez &

Lakusta, 2004) that children are able to infer similarity between words from the contexts in

which they occur (see also Monaghan & Christiansen, 2008 for an extensive investigation

of how children might cluster words together using a number of probabilistic phonological

cues).

In this study, we do not attempt to distinguish between these two sources of similarity.

We employ distributional information and operationalize similarity between words by cal-

culating the overlap in their contexts as they occur in a corpus of child-directed speech.

However, we cannot be sure whether this measure is the basis that children use to infer simi-

larity. It has long been acknowledged that distributional and semantic similarity are likely to

be highly intercorrelated, and that words that have similar meanings will occur in similar


contexts (see Landauer & Dumais, 1997 for a broad overview of the distributional approach

to meaning). In this experiment, we are concerned simply with whether the children exploit

the similarity in inferring lexical patterns from the input, and not with the origin of their

detection of that similarity.

Our second prediction, then, is that children will be more likely to detect the potential for

productivity in a four-word sequence and be better at repeating novel instantiations of it

when the relevant position has tended to be filled with (semantically or distributionally) sim-

ilar items. We measure the similarity of the items that have been seen to go into particular

slots by looking at how similar the contexts in which they appear are. For all words found in

our slots we look at the words that occur two words before and two words after the item in a

large corpus of child-directed speech. We record the number of times that each word in the

vocabulary occurs within this window. This then gives us a co-occurrence vector for each

word, with each entry in the vector representing a dimension in a multidimensional space

(where the dimensions are the vocabulary of the language). The similarity between any two

words is then taken to be the cosine of the angle between those two vectors (a value between

0 and 1 with higher values indicating greater similarity). In order to calculate the overall

cohesiveness of a slot (i.e., the homogeneity or the semantic density of the words previously

seen to fill it), we obtained the mean pairwise distance of each word that occurred in that

slot from each other word that occurred there. We call this measure slot semantic densityand calculate it for the final position slot, X, of each sequence containing N different words

as follows:

Semantic Density ðXÞ ¼ 1

N2 �N

X

x2X

X

y 6¼x2Xcos ðx; yÞ

If children are sensitive to the semantic density of a slot, then they might find it easier to

produce unfamiliar versions of a four-word sequence if the final slot has both high entropy

and high semantic density. For example, given two highly frequent utterances with high slot

entropy, ‘‘Back in the box’’ and ‘‘A piece of toast,’’ that differ in the semantic density for

the final word position (for ‘‘Back in the X’’ the semantic density is 0.63, for ‘‘A piece of

X’’ the semantic density is 0.39), children might be more likely to accept unfamiliar ver-

sions of the sequence that has the greater semantic density than the sequence with lower

semantic density.

Of course, whether such an effect of semantic density holds may depend on the nature of

the final word in the unfamiliar sequence. Thus, variants of ‘‘Back in the box’’ might only

be easy to repeat if the final word is semantically similar to other words attested in that slot

(e.g., containers like ‘‘case’’ or ‘‘fridge’’). In order to test this we selected items that were

semantically similar (‘‘case’’) or dissimilar (‘‘town’’) to words seen in the relevant position

in the corpus (see the Method section for details). We refer to the former kind of word as

semantically ‘‘typical’’ and the latter kind as semantically ‘‘atypical.’’

We thus predicted that unfamiliar sequences would be easier to repeat if they were ver-

sions of a construction with high slot entropy and high semantic density and if the final word

were semantically typical for that slot. Our predictions for these unfamiliar sequences rest


on the expectation that the child should not have often uttered them before (if at all) and that

they should accordingly be processed as generalizations. To further investigate this pro-

posal, we also tested familiar sequences that could in principle be retrieved directly from

memory. Our predictions here were more speculative. We have previously found (Bannard

& Matthews, 2008) that children are better at repeating sequences of words that they have

frequently encountered before, and it is not clear how having formed a generalization over

similar sequences might affect their facility with such familiar instances. One might predict

that highly frequent stored sequences will be unaffected by the presence of related items.

On the other hand, the possibility of generalizations might actually inhibit the production of

familiar word sequences, so that high-frequency items that instantiate low-entropy patterns

might be expected to be more fluently produced than their high-entropy counterparts.1 The

effect of semantic density on well-integrated familiar sequences could also plausibly be ben-

eficial or detrimental, as having many semantically similar neighbors could presumably

either inhibit or enhance production of the sequence (c.f. Magnuson, Mirman, & Strauss,

2007). Note that, as the final word of a familiar sequence is likely to be semantically typical,

we did not vary this factor for familiar sequences.

To summarize, in the current study we analyzed the properties (slot entropy and semantic

density) of four-word schemas that had a lexically specified three-word stem plus a final slot

(we henceforth refer to these as schemas) as observed in a large database of British English

child-directed speech. We tested how these properties affected children’s ability to repro-

duce unfamiliar variants of these schemas and checked whether these effects were mediated

by the semantics of the final word in the unfamiliar target. We also checked whether these

same properties would affect the repetition of familiar sequences (although these could not

be fully matched to the unfamiliar sequences for all control variables). We tested children’s

ability to comprehend and produce the 27 sequences given in Table 1 by playing them

recordings and asking them to repeat them.

2. Method

2.1. Participants

Fifty-nine normally developing, monolingual, British English-speaking children were

included in the study (32 boys). There were twenty-eight 2-year-olds (range 2.3–2.10, mean

age 2.7) and thirty-one 3-year-olds (range 3.1–3.7, mean age 3.4). A further 18 children

were tested but not included because of fussiness or inaudible responding. The children

were tested in a university laboratory in the United Kindom or in a quiet room in their day

care center.

2.2. Materials and design

The stimuli for each child consisted of nine triplets of four-word sequences.

These sequences were selected using a child language corpus, the largest available to us,


containing the speech directed to one child, Brian, between the ages of 2 and 5 years

recorded in Manchester, UK (Max Planck Child Language Corpus: 1.72 million words of

maternal speech). We chose to look at four-word sequences because previous studies have

demonstrated that these are sufficiently long to elicit variance in participants’ performance

in a repetition task (Bannard & Matthews, 2008; Valian & Aubry, 2005). We extracted all

repeated sequences of words from the corpus using the method described in Yamamoto and

Church (2001) and discarded all sequences that formed a question (as children might be

tempted to answer a question rather than repeat it). Applying this filter meant that our most

frequent item was ‘‘I don’t know what’’ which occurred 260 times (a natural log frequency

of 5.56). Our log frequency range was then taken to be 0–5.56.

We next identified all sequences of four words that began with the same first three words

(had the same schema). We identified all schemas for which at least one instantiation was in

the top two-thirds of the log frequency range (so that we would have at least one familiar

example for later use). For each of these schemas, we calculated the slot entropy and slot

semantic density for the fourth word position, as outlined in the introduction. We then

Table 1

Stimulus sequences and their properties

Sequence Familiarity Slot Entropy Semantic Density Typicality of Fourth Word

Out of the water High 6.17 0.58 Typical

Out of the liquid Low 6.17 0.58 Typical

Out of the pudding Low 6.17 0.58 Atypical

Back in the box High 5.31 0.63 Typical

Back in the case Low 5.31 0.64 Typical

Back in the town Low 5.31 0.64 Atypical

A piece of toast High 5.16 0.39 Typical

A piece of meat Low 5.16 0.39 Typical

A piece of brick Low 5.16 0.39 Atypical

Have a nice day High 4.37 0.46 Typical

Have a nice hour Low 4.37 0.46 Typical

Have a nice meal Low 4.37 0.46 Atypical

It’s time for lunch High 3.78 0.40 Typical

It’s time for soup Low 3.78 0.40 Typical

It’s time for drums Low 3.78 0.40 Atypical

A bowl of cornflakes High 2.83 0.37 Typical

A bowl of biscuits Low 2.83 0.37 Typical

A bowl of flowers Low 2.83 0.37 Atypical

What a funny noise High 2.11 0.46 Typical

What a funny sound Low 2.11 0.46 Typical

What a funny cup Low 2.11 0.46 Atypical

You bumped your head High 2.10 0.60 Typical

You bumped your leg Low 2.10 0.60 Typical

You bumped your toy Low 2.10 0.60 Atypical

Let’s have a look High 1.24 0.46 Typical

Let’s have a see Low 1.23 0.46 Typical

Let’s have a think Low 1.23 0.46 Atypical


ordered these schemas according to slot entropy and identified items that spanned the range

of observed values. The second key factor that we wish to explore in this paper is the impact

of semantic density, and thus it was important that we cross this with slot entropy in our

stimuli. For this purpose we (for the purposes of item selection only; slot entropy was trea-

ted as a continuous variable in all our analyses) put the items into bands of high, medium,

and low slot entropy bands and for each we selected schemas that spanned the range of pos-

sible semantic density values as much as possible. Our need to meet all of these criteria

meant that we had little freedom in selecting the stimuli. Thus, it was not possible to select

schemas of a particular syntactic type or types. The effect on learning that we hypothesize

the factors of slot entropy and semantic density to have might be expected to interact with

the child’s developing knowledge of syntactic types or categories (they might e.g., expect

differing degrees of semantic flexibility for a slot in a noun phrase that in a verb phrase).

Nonetheless, we would predict that their effect should be seen across syntactic types. We

therefore chose to select the items that maximize the spread of our key predictors, leaving

the impact of syntactic type as to be considered in our statistical analysis. The distribution

of the items across our key predictor variables can be seen in Fig. 1. Our items reflect multi-

ple syntactic types. One might for example, divide the stimuli set into prepositional phrases

(back in the X, out of the X), noun phrases (a bowl of X, a piece of X), and sentences (you

bumped your X, what a funny X, let’s have a X, have a nice X, it’s time for X). While cer-

tain syntactic types appear to cluster together here (e.g., the prepositional phrases), there is

no absolute correlation between syntactic type and our factors. We will later explore the

impact of this grouping (which is detailed again in Appendix B for convenience of refer-

ence) in our data analysis.

Fig. 1. Distribution of test items.


Having identified our schemas, we then obtained one familiar sequence (seen in the corpus

with reasonable frequency) and two unfamiliar sequences (plausible sequences that were

nonetheless unseen in our corpus for each schema). The familiar sequence was obtained from

the top two-thirds of the overall log frequency range of four word sequences. However, it is

important to note that it was not always the most frequent instantiation of the schema, and

that the schemas were rarely dominated by any one sequence (the highest frequency

instantiations of each of our schemas accounted for a mean of 36% of instances). On average,

our selected high-frequency items accounted for 31% of the instantiations of the schema.

In order to create two unfamiliar items for each schema, we used the WordNet Lexical

database v2.1 (Fellbaum, 1998; WordNet is an IS-A hierarchy (in the sense an apple IS A

fruit) created in the psychology department at Princeton University that represents semantic

relations between English words) to identify one word that was highly similar to the final

word of the selected familiar sequence and one that was semantically dissimilar from this

(all nouns cited in appropriate sense in the The Oxford English Dictionary, 1989). Within

WordNet, our unseen typical words were in all cases a maximum of five nodes away from

the seen words (the threshold on similar pairs proposed by Hirst & St-Onge, 1998). In two

cases, the unseen word was a direct hypernym of the seen word (water => liquid) or vice

versa (noise => sound). In another two cases, the two words were linked by a direct hyper-

nym of both words (box => container <= case; day => time unit <= hour), and in all other

cases except one (lunch => meal => nutriment <= dish <= soup) they were linked via a node

that was an immediate hypernym of one of the pair (e.g., toast => bread => baked goods =>

food <= meat). We refer to the former, similar items as ‘‘typical’’ and the latter as ‘‘atypi-

cal.’’ In order to verify the typicality or otherwise of these words for each given schema, we

obtained human judgments as to their similarity to the words seen in the schema over the

corpus (see Appendix S3 for details). For all but one of the schemas the typical word was

judged to be more similar (on average) to the items seen in the schema over the corpus than

was the atypical word. Pairs of typical and atypical words were matched for their length in

syllables and, as far as possible, their frequencies (see Appendix A).

As mentioned above, for each of the nine schemas, we attempted to control for differ-

ences in the fourth word frequencies as far as possible (the first three words were identical).

However, it was not possible to match the frequency of the final word, bigram, or trigram of

the unfamiliar items with the familiar items. Similarly it was not possible to control the fre-

quency of component words, bigrams, or trigrams across different schemas. As we would

expect these component frequencies to affect children’s ability to repeat sequences, we fac-

tored their effect out by including them as predictors in all regression models. The 10 fre-

quency counts for each four-word sequence (i.e., the frequency of the four-word sequence

and its four component words, three component bigrams, and two-component trigrams) are

given in Appendix A.

To allow us to evaluate the impact of all these separate frequencies without introducing

multicollinearity into our models, we reduced the counts to orthogonal dimensions using

principal components analysis. We did this separately for the familiar and unfamiliar

items as they were intended to be used in separate analyses. We retained all factors with

Eigenvalues greater than 1 which left us with four components for the unfamiliar items


(accounting for 95% of the total variance), and three components for the familiar items

(accounting for 93% of total variance). A fuller description of this procedure and a discus-

sion of the loadings for the selected components can be found in Appendix S1.

To summarize, this procedure gave us, for each schema, one familiar (high-frequency)

sequence and two unfamiliar (unseen) sequences. One of the unfamiliar sequences had a

final word that was semantically similar to the familiar item observed in this position in our

corpus (unfamiliar, typical) and the other has a dissimilar final word (unfamiliar, atypical).

The final 27 stimulus sequences and their properties are presented in Table 1.

All sequences were read by a female British English speaker with normal declarative

intonation and recorded in a soundproof booth onto a computer disk with a sampling fre-

quency of 44,100 Hz using SoundStudio v.3 (Freeverse, New York, NY, USA). To ensure

that the first three words of all matched sequences were identical, we took one sequence as a

base and created the matched pair by splicing in the final word using the open-source

Audacity software v.1,2.4. We used randomly selected familiar sequences, unfamiliar typi-

cal sequences, and unfamiliar atypical sequences as bases for a third of the items each.

To ensure that sequences of the same schema type were not encountered in close succes-

sion, test items were presented in three blocks of nine items with each block containing one

of the variants of a schema in one of two fixed orders (one the reverse of the other), such

that each of the three sequences belonging to the same schema was always nine items apart.

All three blocks contained an equal number of familiar and unfamiliar sequences and typical

and atypical items. These blocks were presented in six orders, with order of presentation

counterbalanced for each age group.

2.3. Procedure

The experimenter, E, sat with the child at a table in front of a computer (the child either

sat alone or on a parent’s knee). E produced a picture of a tree with several stars in the

branches and explained they would cover each star with a parrot sticker. E explained that, to

get the stickers, they needed to listen to what the computer would say and then say the same

thing. Every time they did so, part of a cartoon parrot would appear on the computer. Once

they could see the whole parrot (which appeared every three trials), they would get a parrot

sticker. E proposed to have a go first. She then clicked on a mouse to play the first of six

example sequences and repeated the sequence. She repeated this for the next two example

sequences, at which point a full parrot was visible and so E awarded herself a sticker before

offering the child a turn. The final three example sequences were used for the child to prac-

tice the procedure. E helped the child or replayed the practice sound files once each if neces-

sary. Each time the child had attempted to repeat three sequences s ⁄ he was given a sticker.

E then played the test sequences in exactly the same manner except that no help was given,

no sound files were replayed, and E did not help the child repeat anything. If the child did

not spontaneously repeat a sequence after a reasonable delay, E prompted the child once

(saying Can you say that?). If the child did not then respond, or if anything other than this

prompt came between the stimulus sequence and the repetition, the response was excluded.

Responses were also excluded if the child did not hear the stimulus sequences (e.g., if the


child spoke unexpectedly as the sound file played. In total, 148 of a possible total of 1,593

responses were excluded). The procedure continued until all 27 sentences were repeated.

Responses were recorded onto computer disk using Audacity recording software.

2.4. Transcription and error coding

Each word in each sequence was coded for the presence or absence of the errors in

Table 2. (The use of such criteria was found in previous studies to improve coder accuracy

in comparison to a procedure where coders directly coded the accuracy of each whole three-

word stem as correct or incorrect.) If the child did not make a single error on the first three

words of the sequence, this sequence was coded as correctly repeated; otherwise it was

incorrect. We did not consider errors made on the fourth word as our focus here was on the

child’s competence with the schema and we wished to minimize the impact of the phonetic

details of the novel item. If a child did not respond to an item, it was discarded along with

the other items in that schema. Two research assistants blind to the hypotheses of the experi-

ment transcribed and coded all the children’s responses from audio files. Agreement

between these coders was good (Agreement: 82%, Cohen’s kappa = 0.62). A third research

assistant, also blind to the hypotheses of the experiment, checked all cases in which the first

two coders did not yield identical coding for each word, listened to the relevant response,

and resolved the discrepancy.

3. Results

All of the children attempted to repeat the vast majority of items (1,445 observations in

total). The 2-year-olds repeated the first three words of 21% of the unfamiliar sequences and

30% of the familiar sequences correctly. The 3-year-olds repeated the first three words of

49% of the unfamiliar sequences and 54% of the familiar sequences correctly. As noted in

the method, this apparent frequency effect may stem from the frequency of the four-word

sequences or their component words, bigrams, or trigrams (because these counts are highly

Table 2

Error codes used for children’s responses

Code Error

Repetition Whole word or one syllable of the word is repeated.

Deletion Whole word is missing.

Insertion Insertion of a word or isolated phonetic material before the target word.

Substitution Target word substituted for another different word.

Mispronunciation Target word is missing a phoneme, has a phoneme inserted, or is a morphological

variant of the target word (e.g., ‘‘bump’’ instead of ‘‘bumped’’ in ‘‘you bumped

your head’’). Missing phonemes that yielded a pronunciation compatible with

adult speech and regional dialect (e.g., ‘‘back int box,’’ which is acceptable in

northern England) were not scored as errors. The pronunciation of ‘‘the’’ as

‘‘de’’ was also accepted.


correlated). We will not discuss frequency effects here but rather include in all models the

four frequency scores derived through principal components analysis (see Appendix S1).

Because of the need to factor out these confounds before the effect of our factors of interest

can be usefully observed, we do not present raw data here.

To investigate the relationship between correct repetition of the first three words of a

sequence and the factors of current interest, we fitted mixed effects logistic regression mod-

els to the data using Laplace approximation (Baayen, 2008; Baayen, Davidson, & Bates,

2008; Dixon, 2008; Gelman & Hill, 2007; Jaeger, 2008). The outcome variable in all models

was whether the first three words of a sequence were correctly repeated (1) or not (0). Child

(N = 59) was added to all models as a random effect on the intercept in order to account for

individual differences. We also ran models with extra random effects for the nine schema

types and the 27 final words of each sequence, but the variance for these factors was always

extremely low—standard deviation always <0.001. We therefore did not include the schema

and item variables in our reported analyses. Including these random effects did not change

the statistical outcome of the results, and models with item and ⁄ or schema included as ran-

dom effects provided a substantially poorer fit to the data (a substantially higher AIC score)

than models including our selected fixed effect predictors. Taken together these finding indi-

cate that item differences other than the manipulated or controlled variables had minimal

impact on the children’s performance. Finally, for all models, we tried introducing the syn-

tactic type of the schema into the model as random effects. We discuss the impact of this on

our models below. All noncategorical predictors were centered by calculating the mean for

the variable and subtracting it from each value. In Appendix S2, we report on an extensive

analysis of the relationship between our predictors, looking for sources of multicollinearity,

and suggest that we can be confident in the analyses presented here.

Putting the control variables into our model allowed us to examine the effect of the fol-

lowing manipulated variables:

1. Age (2 or 3 years old)

2. Slot entropy (continuous)

3. Semantic density (continuous)

4. Final word typicality (typical or atypical)

The principal question of interest is whether these factors affect children’s ability to

repeat unfamiliar sequences. We therefore first fitted a model to the repetition data for the

novel sequences. We added each of these variables to the model in order to examine their

predictive value over and above our controls. We use likelihood ratio tests to compare

nested models and Akaike’s information criterion (AIC) values to compare nonnested mod-

els. We also report McFadden’s log-likelihood ratio index (LLRI; McFadden, 1974) as a

measure of the practical significance of the differences between models.2 First of all, age

was found to lead to a significant improvement in fit when added to a model with only our

controls as predictors (v2(1) = 23.3, p < .0001, LLRI = 0.021). Further, adding slot entropy

again substantially improved the fit of the model (v2(1) = 8.25, p < .005, LLRI = 0.008).

Adding semantic density to a model containing our controls and age did not lead to a


significant improvement in fit (v2(1) = 1.15, p = .284, LLRI = 0.001), and the composite

model had a higher AIC than the model including the controls, age, and slot entropy, indi-

cating that slot entropy has greater predictive value than semantic density. However, a

model including the controls, age, slot entropy, and semantic density had a significantly bet-

ter fit, than a model containing only the controls, age, and slot entropy (v2(1) = 4.6,

p < .05, LLRI = 0.004), indicating that semantic density does have predictive value (once

slot entropy is accounted for) and accounts for additional variance over and above that

accounted for by slot entropy. The addition of the typicality of the test item as a predictor

offered no significant improvement in fit over a model that contained only the controls and

age (v2(1) = 0.31, p = .578, LLRI < 0.001). Similarly, it did not improve fit for models that

additionally contained slot entropy (v2(1) = 0.58, p = .455, LLRI < 0.001), semantic den-

sity (v2(1) = 0.25, p = .614,, LLRI < 0.001), or both (v2(1) = 0.54, p = .462, LLRI <

0.001), indicating that it had no predictive value. We similarly found that including the

human typicality ratings (in Appendix S3) as a continuous predictor gave no improvement

in fit when added to a model containing the controls plus age (v2(1) = 0.17, p = .676,

LLRI < 0.001) or when we additionally added slot entropy (v2(1) = 0.67, p = .41, LLRI <

0.001), semantic density (v2(1) = 0.29, p = .59, LLRI < 0.001), or both (v2(1) = 1.59,

p = .21, LLRI = 0.002).

In Table 3, we report on the parameters of a model (model 1) that contained all controls

and experimental variables. This had a significantly better fit to the data than a baseline

model that included only the random effect of child, control principal components, and age

as predictors (v2(3) = 13.36, p = .004, LLRI = 0.013). For this model, the estimated inter-

cepts for the children varied with a standard deviation of 1.03. Age, slot entropy, and seman-

tic density were all significant (positive) predictors, whereas typicality (included here at a

categorical value—the same pattern was obtained when including the mean human judg-

ments) was not. These results reflect the fact that 2-year-olds were more likely to make

errors than 3-year-olds and that schemas with higher slot entropy and higher semantic den-

Table 3

Fixed effects in model 1 fitted to data for unfamiliar sequences

B

HPD Intervals

SE Z p-ValueLower Upper

(Intercept) )0.67 )1.28 )0.09 0.28 )2.37 .018

Frequency PC1 )0.11 )0.43 0.14 0.14 )0.81 .42

Frequency PC2 0.31 0.09 0.55 0.12 2.69 .007

Frequency PC3 0.99 0.45 1.52 0.26 3.74 <.001

Frequency PC4 )0.09 )0.30 0.13 0.10 )0.85 .397

Age 0.84 0.54 1.22 0.16 5.27 <.001

Slot entropy 1.00 0.41 1.57 0.29 3.44 <.001

Semantic density 0.23 0.02 0.45 0.11 2.17 .030

Typicality )0.12 )0.44 0.19 0.16 )0.75 .455

Note. Concordance between the predicted probabilities and the observed responses, C = 0.838. Somer’s Dxy

(rank correlation between predicted probabilities and observed responses) = 0.676 (c.f. Baayen, 2008).


sity were more likely to be correctly repeated. They are thus consistent with the predictions

that ability to reproduce unseen forms will be greater when (a) children have less specific

expectations about what should occur in the final word position and (b) the items previously

attested in the final word position are more semantically homogenous. In addition to our

estimated maximum-likelihood parameters, we also report on a Bayesian analysis (as rec-

ommended by Baayen et al., 2008) in which we approximate the full posterior distribution

using Gibbs sampling. All model parameters were sampled from normal distributions with

noninformative priors (see section 17.4 of Gelman & Hill, 2007, for BUGS code for a simi-

lar mixed-effects logistic regression model). We show the lower and upper bounds of the

95% higher posterior density (HPD) intervals for each model parameter. This interval covers

95% of the posterior probability and provides a measure of uncertainty. That this interval

does not cross 0 for age, slot entropy, or semantic density gives us further confidence that

they are useful positive predictors of repetition performance.

To test for possible interactions between experimental factors, we ran a more complex

variant of model 1 adding all two-way interactions between age, slot entropy, semantic den-

sity, and typicality, again fitting the model to the data for the low-frequency sequences. This

model was not a significant improvement on model 1 (v2(6) = 8.44, p = .208, LLRI =

0.008) and did not reveal any significant interactions. Simpler variants of model 1, adding

only the interaction between either age and slot entropy or age and semantic density, also did

not give any significant improvement in fit over model 1 or reveal any significant interac-

tions, suggesting that children from both age groups were similarly affected by these factors.

Finally, we wanted to explore what impact the syntactic type of the frame might have.

We did this by adding syntactic class into our model as a random effect on the intercept,

using the classification found in Appendix B. Adding this to a baseline model including

only the control variables and age resulted in a significant improvement in fit (v2(1) = 5.1,

p < .024, LLRI = 0.005). However, this model had a higher AIC value that model 1 (indi-

cating that model 1 offers a better fit to the data). Furthermore, a model with child and syn-

tactic type as random effects on the intercept plus control variables, age, slot entropy,

semantic density, and typicality as fixed effects gave a significant improvement in fit over a

model including child and syntactic class as random effects with only control variables and

age as fixed effects (v2(3) = 8.1, p < .05, LLRI = 0.008). Revealingly when we added syn-

tactic class as a fixed effect to model 1 there was no improvement in fit (v2(1) = 1.54,

p = .22, LLRI = 0.001), suggesting that the variance accounted for by syntactic class is a

subset of that accounted for by our predictors. In summary, our predictors were seen to have

significant explanatory value over and above that provided by the pooling of variance by

syntactic class and the analysis offers strong support for the view that they apply across

phrases of different syntactic types.

Having considered how the properties of a four-word sequence affect the repetition of

unfamiliar sequences, an additional question of interest is whether slot entropy and semantic

density also affect the production of highly familiar word sequences. As it is very

difficult to predict whether high-frequency items would benefit or not from high entropy

(see Introduction), this analysis was more exploratory. We again investigated the value of

the various predictors via model comparison. Adding age to a model including only the


controls again resulted in a significant improvement in fit (v2(1) = 16.59, p < .0001, LLRI =

0.029). Unlike for the unfamiliar sequences, adding slot entropy to the model including the

controls and age resulted in no improvement in fit (v2(1) = 1.54, p = .214, LLRI = 0.003).

However, the addition of semantic density to the model did result in a significant improve-

ment in fit (v2(1) = 9.28, p < .005, LLRI = 0.016). Unlike for the unfamiliar sequences, this

did not depend on the inclusion of slot entropy. A model containing slot entropy and seman-

tic density in addition to the controls plus age offered no improvement in fit over one includ-

ing the controls, age, and semantic density alone (v2(1) = 0.19, p = .660, LLRI < 0.001).

Table 4 reports on the parameters for a model (model 2) including all the predictors

(except of course typicality, which was not varied for high-frequency sequences) to the data

for the high-frequency items. Age was again a significant positive predictor, with 2-year-

olds being more likely to make mistakes in repetition. Semantic density was found to be a

significant positive predictor, meaning that children were more likely to successfully repro-

duce a high-frequency sequence if the words that are typically seen in the last position of

the schema are highly similar. Slot entropy was not found to be a significant predictor. A

model including two-way interactions between age, slot entropy, and semantic density was

not found to be an improvement over model 2 (v2(3) = 3.88, p = .275, LLRI = 0.007), and

no interactions were found to be significant. The same applied for simpler models including

any combinations of two-way interactions together or in isolation. We again also performed

a Markov Chain Monte Carlo analysis and report HPD intervals for the models parameters.

Finally, we again wanted to explore what impact the syntactic type of the frame might

have. We did this once more by adding syntactic class into our model as a random effect on

the intercept, using the classification found in Appendix B. Adding this to a baseline model

including only the control variables and age did not result in a significant improvement in fit

(v2(1) = 0.008, p = .931, LLRI < 0.001). Adding syntactic class to model 2 as a random

effect resulted in no change in fit. Furthermore, model 2 had a much smaller AIC score

(569.6) than a model containing only the controls, age, and syntactic class as a random

effect (576.2). Thus, unlike for the unfamiliar sequences, the syntactic class of the sequence

seemed to have no effect on the children’s ability to produce the sequence.

Table 4

Fixed effects in model 2 fitted to data for familiar sequences

B

HPD Intervals

SE Z p-ValueLower Upper

(Intercept) )0.46 )0.85 )0.12 0.17 )2.74 .006

Frequency PC1 )0.47 )0.75 )0.21 0.13 )3.56 <.001

Frequency PC2 0.54 0.32 0.80 0.11 4.81 <.001

Frequency PC3 < )0.01 )0.48 0.47 0.24 0.05 .969

Age 0.72 0.37 1.15 0.17 4.30 <.001

Slot entropy 0.11 )0.38 0.64 0.25 0.45 .652

Semantic density 0.38 0.68 1.53 0.13 2.88 .004

Note. Concordance between the predicted probabilities and the observed responses, C = 0.845, Somer’s

Dxy = 0.689.


4. Discussion

The current experiment set out to test whether the distributional properties of simple

four-word schemas (as estimated using a large corpus of child-directed speech) would affect

how accurately unfamiliar versions of the schemas will be repeated by young children. One

prediction was that the less certain a child is as to the way a sequence will end given the

statistics of maternal input (the greater the slot entropy), the more likely he or she will be to

form a basic generalization and hence the easier he or she would find it to produce an unfa-

miliar sequence. This indeed appears to be the case. Children in both age groups were better

able to reproduce unfamiliar sequences with higher slot entropy. The semantic properties of

a slot also affect ease of repetition of unfamiliar sequences. The more semantically similar

the items that are likely to have been previously heard in a slot, the easier it was for children

to repeat an unfamiliar variant of that schema. The patterns used in our experiment spanned

syntactic phrase types, and we found in our statistical analysis that slot entropy and semantic

density had predictive value over and above syntactic class, suggesting that they affect

learning across phrase types.

In contrast to our predictions, we observed no effect of the typicality of the final word in

the unfamiliar sequences (assessed using both a categorical distinction based on the Word-

Net hierarchy and human judgments) and no interaction between the semantic density of the

slots in our schemas and the typicality of our items (suggesting that producing a sequence

that ended in a word that did not fit the semantics of the slot was apparently no harder if that

slot was semantically very constrained). As an anonymous reviewer pointed out, this finding

can be seen as consistent with a construction-based approach. That is, while the properties

of the elements seen in a slot should affect the identification of a schema ⁄ construction at the

point of learning, once a construction has been created, an open slot in a good construction

should be able to take any word of the appropriate category. While such an explanation is

plausible, we hesitate to explain our findings in this way. We see any sharp distinction

between patterns in language that are constructions and those that are not that might be

implied in the usage-based literature as an idealization for descriptive convenience rather

than a strong claim about mental representation. We prefer to think of the learner as identi-

fying very many patterns in the input which continue to compete for utilization, with the

specific distributional properties of a pattern remaining an important part of the representa-

tion, rather than being discarded once a decision has been made to put a given schema ‘‘in

the grammar.’’ Additionally, we also suspect that the lack of a typicality effect can be

explained by aspects of our study design, as we now discuss.

Our measure of typicality was based on evaluating the similarity of individual words not

seen in the schema to individual words seen in the schema without considering the impact

of context. It is possible, then, that our manipulation of typicality was not effective because,

in creating items whose meaning matched particular observed items, we introduced a degree

of unnaturalness to the ‘‘typical’’ stimuli that may have disguised any effects. In finding

matches, we were also forced to use low-frequency words in some cases, so it could be that

the children were not that familiar with the semantics and semantic relatedness of some of

our items (e.g., the similarity of ‘‘box’’ to ‘‘case’’ and ‘‘water’’ to ‘‘liquid’’ might not be


transparent for a 2-year-old). Alternatively, in judging typicality or atypicality with refer-

ence to the particular familiar item used, we may not have picked up on confounding simi-

larities and distances from other words that can appear in the schemas. In brief, much

further work is required to develop developmentally plausible measures of semantics before

we can draw any firm conclusions about typicality effects.

Further work on semantics would also allow us to clarify the beneficial effect of semantic

density, which is potentially controversial. In the terminology of the usage-based tradition,

slot formation can be seen as an instance of category formation on the basis of functionally

based distributional analysis (c.f. Tomasello, 2003; p.124). That is, children should have

expectations concerning what words or phrases they are going to see in a particular position

based on the functions of the words that have been seen there before. This led us to predict a

positive effect of semantic density (and typicality). On the other hand, many theorists have

proposed that semantic openness would benefit productivity (e.g., Bybee, 1995; Goldberg,

2006). Our current findings of an effect of semantic density but not of typicality or of an

interaction between the two do not sit easily with either account. Further investigation will

be required to pull this apart, but the current results certainly suggest that a degree of seman-

tic coherence aids repetition even in the absence of a semantic link between the target sen-

tence and the construction semantics.

While the main focus of our study was on generalization and hence on children’s repeti-

tion of the unfamiliar items, we also asked children to repeat a single instantiation of each

schema that occurred with some frequency in our corpus and hence with which we could

expect the children to be familiar. The purpose of this was to investigate whether schema

properties affect processing even in circumstances where a sequence could in principle be

retrieved directly from memory. There was no effect of slot entropy on the repetition of

familiar items, suggesting that the children are employing a different route to production for

such items. We did, however, observe an effect of semantic density on the repetition of

familiar items, which would suggest the opposite. Further work will be required to clarify

why we see an effect of one factor but not the other. As we noted in the introduction and

explain further in the method, it could simply be that the relationship between the frequency

of the familiar string and the entropy of the schema is not a straightforward one in our stim-

uli. Further testing with more items would of course help to clarify this, but, as we now dis-

cuss, expanding our list of items is not straightforward.

In the current study, we were able to identify items that were dispersed over the range of

slot entropy and semantic density values. However, doing so left us little freedom in choos-

ing items. The fact that many of the factors that are considered to contribute to children’s

language learning are difficult to isolate in this way is not only a practical problem. It also

shows how different factors overlap in the input (sometime supporting one another and

sometimes conflicting with one another), and thus it emphasizes the gap between the kind of

idealized problems children face in artificial grammar learning experiments and those

children face in learning language. Bridging this gap will almost certainly require

conducting more experiments of the current variety. Doing so will allow us to investigate

phonological, syntactic, and semantic factors that we were not able to control in the current

study.


A final limitation of the current study is because of the nature of the repetition task. It is

usually assumed that when asked to repeat an utterance children analyze the utterance and

then generate it as they would in ordinary speech. The task thus draws on comprehension

and production skills in turn. Failure to repeat the utterance might be because of difficulty

understanding it, difficulty articulating it, or both. Complementary methodologies are

required to further clarify when in processing the effects we report take hold. Alternative

methods would also allow us to explore the task specificity of the current effects. For

example, it could be that the test situation leads the child to be more conservative or more

careful than the child would be in normal speech, which might explain, for example, our

failure to find an effect of typicality.

So what are the broader implications of the current study for language learning? In previ-

ous work (Bannard & Matthews, 2008) we have shown that sequences of sounds that are

heard with little variation in the input are likely (as predicted by the many findings in the

word segmentation literature) to be identified as units of language that are candidates for

words or holophrases, with direct reuse of such sequences from the input being preferred

where available and frequent. In the present paper, we have shown that if such sequences

occur with some points of variation then the possibility of forming productive morpho-syn-

tactic slots arises and becomes more likely if slot fillers form coherent categories. Unfamil-

iar sequences that match resulting, partially abstract schemas will be processed more

fluently (c.f. Buchner, 1994; Pothos, 2007 for similar effects of fluency in artificial grammar

learning). This proposal is in line with a growing literature on ‘‘variation sets’’—successive

utterances in child-directed speech that have partial lexical overlap (Kuntay & Slobin, 1996;

Onnis, Waterfall, & Edelman, 2008). These studies suggest the effects observed in the cur-

rent study arise because many of the three-word stems will have occurred in variant forms

in quick succession in the input.

The processes of learning we have sketched here are arguably most consistent with con-

structivist approaches to language development (e.g., Edelman, 2007; Goldberg, 2006; Tom-

asello, 2003). On such accounts grammatical development occurs in a piecemeal fashion

with early knowledge consisting of sequences of words taken directly from the input with

limited generalization across forms. In the present study, we have provided evidence that

children’s ability to produce novel sequences of words can be predicted from their previous

experience with overlapping sequences, and that this holds for 3-year-olds as for 2-year-

olds. We note, however, that this does not rule out the likely possibility that children this

young might be quite adept at producing syntactic structures even in the absence of exposure

to many directly overlapping forms. Rather this finding demonstrates that children are sensi-

tive to statistical regularities in their language that are plausibly relevant to learning about

syntactic structure. We find this question of learnability more interesting than the question

of when precisely children show abstraction of a given syntactic structure (c.f. Pulvermuller

& Knoblauch, 2009 for a recent attempt at a neurally plausible account of the acquisition of

a simple combinatorial grammar where abstraction and learnability sit happily together).

We should also note that even if highly abstract syntactic structures are in principle avail-

able to the child, it is not obvious that the child should prefer to store or use them. Indeed

we would suggest that lexically specific representations are unlikely to be just a ladder to


abstract syntax, to be kicked away once learning is complete. Rather, they might be

expected to form part of any rational agent’s model of the language he or she is trying to

learn. A rational learner will want to find the model that assigns a high probability to the

exact data observed and reduces the probability of other possible sets of data (see chapter 28

of Mackay, 2003 for a detailed Bayesian approach to model comparison of this kind).

Abstract models by their very nature are less tied to particular data and can be used to gener-

ate a larger set of possible language. All else being equal, we would expect a rational learner

to reuse the input as much as possible even once he or she has acquired additional compe-

tence. Although it is by no means clear whether a psychologically plausible model of lan-

guage learning will reveal children to be rational in this sense, this might explain why we

see these kinds of lexically specific representations relatively late in development even once

more abstract representations can be expected to have emerged.

So what linguistic theories might account for our data? The idea that speakers store and

use sequences of specific words has been acknowledged by all models of syntax and is not

exclusive to usage-based accounts. All theories, after all, need to account for the presence in

language of idiomatic phrases. Where theories differ is in how such phrases fit into their

account. Early generative accounts regarded idioms as simply an extension of a lexicon that

was very much separate from the core grammatical processes, and they were argued to

obtain meaning ‘‘in the manner of a lexical item rather than as a projection from the mean-

ings of its constituents in the manner of compositional complex constituents...’’ (Katz,

1973; p. 358). It has come to be acknowledged that the kind of phrases that reoccur with fre-

quency and that appear not to be the result of a fully abstract generative process is rather lar-

ger than earlier theorists had supposed (Jackendoff, 1995). Furthermore, the distinction

between grammar and lexicon has come to be regarded as unsustainable in many contempo-

rary generative models where information about how words can combine is a part of lexical

entries, with composition occurring via uniform operations (e.g., Bresnan, 2001; Croft,

2001; Goldberg, 1995; Pollard & Sag, 1994; I. A. Sag, unpublished data; Steedman, 2000).

There has been a growing awareness that multiword sequences interact with syntactic and

semantic phenomena in a way that makes a dual-route model in which they are stored sepa-

rately untenable (e.g., Nunberg, Sag, & Wasow, 1994), and word sequences have come to

be acknowledged as integrated with core grammatical processes (e.g., Culicover & Jackend-

off, 2005; Jackendoff, 2002).

While our findings are incompatible with an account of syntactic competence which draws

a strict distinction between memory-based processing at the word level and procedural pro-

cessing for grammar (Ullman, 2001), they could, it seems, be accounted for by any model of

syntax in which sequence-specific processing is given a role. However, it is important to con-

sider that accounting for the behavior observed here requires any such theory to be somewhat

liberal in deciding what sequences will be stored. The integration of sequence or construc-

tional level representations and processes into theories of grammatical competence has been

motivated by the observation that there are sentences that cannot otherwise be accounted for.

Arguments for this have tended to be based on the syntactic or semantic nature of the phrase

and its incompatibility with general compositional or productive processes. We see no reason

to believe that the patterns which we use in our study are syntactically or semantically


idiosyncratic. The explanation for the children’s having pattern-specific representations

seems rather to be a matter purely of their distribution. This fact is easiest to accommodate

within a usage-based approach where linguistic knowledge is made up of pairings of function

with form at any point on the lexically specific ⁄ syntactically abstract continuum.

If we agree that constructions have some psychological primacy across the life span, then

this study makes a contribution in suggesting what factors would lead to their identification.

However, regardless of how we want to characterize the end point of learning, the results

here favor the acceptance of a model of syntactic competence in which lexically specific

processing plays a substantial role at the ages of 2 and 3.

Notes

1. This prediction is complicated by the fact that familiar sequences may vary in the

degree to which they are an expected completion of a known pattern. For some items

lower entropy might be especially beneficial. This would be the case if one had a

strong expectation about what would come next and the high-frequency sequence ful-

filled that expectation. However, it is possible that, although highly frequent, some

items would not be most expected for a child and for these there may be a degree to

which higher entropy is better.

2. The log likelihood ratio index indicates the proportion of the variance explained by

the more complex model that is accounted for by the predictors of interest. It can be

interpreted as a partial pseudo R2 value (see Veall and Zimmermann, 1996).

Acknowledgments

The authors would like to thank Jess Butcher, Ellie O’Malley, Manuel Schrepfer, and

Elizabeth Wills for help in data collection and coding; Harald Baayen and Roger Mundry for

statistical advice; and Bruno Estigarribia, Adele Goldberg, and Julian Pine for helpful com-

ments on the manuscript. This research was supported by postdoctoral fellowships awarded

to both authors by the Max Planck Institute for Evolutionary Anthropology, Leipzig.

References

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics. Cambridge, England:

Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for

subjects and items. Journal of Memory and Language, 59(4), 390–412.

Bannard, C., & Matthews, D. E. (2008). Stored word sequences in language learning: The effect of familiarity of

children’s repetition of four-word combinations. Psychological Science, 19(3), 241–248.

Braine, M. (1976). Children’s first word combinations. Monographs of the Society for Research in Child Devel-opment, 41(1), serial no. 164.


Bresnan, J. (2001). Lexical-functional syntax. Malden, MA: Blackwell.

Buchner, A. (1994). Indirect effects of synthetic grammar learning in and identification task. Journal of Experi-mental Psychology: Learning Memory, and Cognition, 20(3), 550–566.

Bybee, J. (1985). Morphology: A study of the relation between meaning and form. Amsterdam: John Benjamins.

Bybee, J. (1995). Regular morphology and the lexicon. Language and Cognitive Processes, 10(5), 425–455.

Croft, W. (2001). Radical construction grammar: Syntatic theory in typological perspective. Oxford, England:

Oxford University Press.

Culicover, P. W., & Jackendoff, R. (2005). Simpler syntax. Oxford, England: Oxford University Press.

Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and Language, 59(4),

447–456.

Edelman, S. (2007). Behavioral and computational aspects of language and its acquisition. Physics of LifeReviews, 4, 253–277.

Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

Freudenthal, D., Pine, J. M., Aguado-Orea, J., & Gobet, F. (2007). Modelling the developmental pattern

of finiteness marking in English, Dutch, German and Spanish using MOSAIC. Cognitive Science, 31,

311–341.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel ⁄ hierarchical models. Cambridge,

England: Cambridge University Press.

Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago:

University of Chicago Press.

Goldberg, A. E. (2006). Constructions at work: The nature of generalization in language. Oxford, England:

Oxford University Press.

Gomez, R. L. (2002). Variability and detection of invariant structure. Psychological Science, 13(5), 431–436.

Gomez, R. L., & Lakusta, L. (2004). A first step in form-based category abstraction by 12-month-old infants.

Developmental Science, 7(5), 567–580.

Gomez, R. L., & Maye, J. (2005). The developmental trajectory of nonadjacent dependency learning. Infancy,

7(2), 183–206.

Hale, J. (2006). Uncertainty about the rest of the sentence. Cognitive Science, 30(4), 609–642.

Harris, Z. (1964). Distributional structure. In J. Fodor & J. Katz (Eds.), The structure of language: Readings inthe philosophy of language (pp. 33–49). Englewood Cliffs, NJ: Prentice Hall.

Hirst, G., & St-Onge, D. (1998). Lexical Chains as representations of context for the detection and correction of

malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). Cambridge,

MA: MIT Press.

Jackendoff, R. (1995). The boundaries of the lexicon. In M. Everaert, E. Van der Linden, A. Schenk, &

R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 133–165). Hillsdale, NJ:

Lawrence Erlbaum Associates.

Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar, evolution. Oxford, England: Oxford

University Press.

Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit

mixed models. Journal of Memory and Language, 59(4), 434–446.

Johnson, E. K., & Tyler, M. D. (in press). Testing the limits of artificial language learning. DevelopmentalScience.

Katz, J. (1973). Compositionality, idiomaticity and lexical substitution. In S. Anderson & P. Kiparsky (Eds.), AFestschrift for Morris Halle (pp. 392–409). New York: Holt Rinehart and Winston.

Keller, F. (2004). The Entropy Rate Principle as a predictor of processing effort: An evaluation againsteye-tracking data. Paper presented at the Empirical Methods in Natural Language Processing, Barce-

lona.

Kempe, V., Brooks, P. J., Mironova, N., Pershukova, A., & Fedorova, O. (2007). Playing with word endings:

Morphological variation in the learning of Russian noun inflections. British Journal of DevelopmentalPsychology, 25(1), 55–77.


Kuntay, A. C., & Slobin, D. (1996). Listening to a Turkish mother: Some puzzles for acquisition. In D. Slobin, J.

Gerhardt, A. Kyratis, & T. Guo (Eds.), Social interaction, social context and language: Essays in honor ofSusan Ervin-Tripp (pp. 265–286). Hillsdale, NJ: Erlbaum.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: the Latent Semantic Analysis theory of

acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.

Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126–1177.

Lieven, E. V. M., Pine, J. M., & Baldwin, G. (1997). Lexically based learning and early grammatical develop-

ment. Journal of Child Language, 24, 187–219.

Mackay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge, England:

Cambridge University Press.

Magnuson, J. S., Mirman, D., & Strauss, T. (2007). Why do neighbors speed visual word recognition but slow spo-

ken word recognition? 13th Annual Conference on Architectures and Mechanisms for Language Processing.

McFadden, D. (1974). Conditional log it analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiersin econometrics (pp. 105–142). New York: Academic Press.

Monaghan, P., & Christiansen, M. H. (2008). Integration of multiple probabilistic cues in syntax acquisition. In

H. Behrens (Ed.), Corpora in language acquisition research (pp. 139–163). Amsterdam: Johns Benjamins.

Moscoso del Prado Martın, F., Kostic, A., & Baayen, H. (2004). Putting the bits together: An informational theo-

retic perspective on morphological processing. Cognition, 94(1), 1–18.

Nunberg, G., Sag, I. A., & Wasow, T. (1994). Idioms. Language, 70, 491–538.

Onnis, L., Waterfall, H. R., & Edelman, S. (2008). Learn locally, act globally: Learning language from variation

set cues. Cognition, 109(3), 423–430.

The Oxford English Dictionary. (1989). Available at http://www.oed.com.

Pelucchi, B., Hay, J. F., & Saffran, J. (2009). Statistical learning in a natural language by 8-month-old infants.

Child Development, 80(3), 674–685.

Pine, J. M., & Lieven, E. V. M. (1997). Slot and frame patterns and the development of the determiner category.

Applied Psycholinguistics, 18, 123–138.

Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of Chicago

Press.

Pothos, E. M. (2007). Theories of artificial grammar learning. Psychological Bulletin, 133(2), 227–244.

Potter, M. C., & Lombardi, L. (1990). Regeneration in the short-term recall of sentences. Journal of Memory &Language, 29(6), 633–654.

Pulvermuller, F., & Knoblauch, A. (2009). Discrete combinatorial circuits emerging in neural networks: A

mechanism for rules of grammar in the human brain? Neural Networks, 22(2), 161–172.

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science,

274(5294), 1926–1928.

Shannan, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of

Illinois Press.

Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237(4820),

1317–1323.

Steedman, M. (2000). The syntactic process. Cambridge, MA: MIT Press.

Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Cambridge,

MA: Harvard University Press.

Ullman, M. T. (2001). The declarative ⁄ procedural model of lexicon and grammar. Journal of PsycholinguisticResearch, 30(1), 37–69.

Valian, V., & Aubry, S. (2005). When opportunity knocks twice: two-year-olds’ repetition of sentence subjects.

Journal of Child Language, 32(3), 617–641.

Veall, M. R., & Zimmerman, K. F. (1996). Pseudo-R2 measures for some common limited dependent vanable

models. Journal of Economic Surveys, 10(3), 241–259.

Yamamoto, M., & Church, K. W. (2001). Using suffix arrays to compute term frequency and document

frequency for all substrings in a corpus. Computational Linguistics, 27(1), 1–30.


Ap

pen

dix

A

Lo

gfr

equ

enci

eso

fst

imu

lus

seq

uen

ces

and

thei

rco

mp

on

ent

wo

rds,

big

ram

s,an

dtr

igra

ms

ina

1.7

2m

illi

on

-wo

rdch

ild

lan

gu

age

corp

us

Seq

uen

ce

Fre

qu

ency

1st

Wo

rd

Fre

qu

ency

2n

dW

ord

Fre

qu

ency

3rd

Wo

rd

Fre

qu

ency

4th

Wo

rd

Fre

qu

ency

1st

Big

ram

Fre

qu

ency

2n

dB

igra

m

Fre

qu

ency

3rd

Big

ram

Fre

qu

ency

1st

Tri

gra

m

Fre

qu

ency

2n

dT

rig

ram

Fre

qu

ency

Bac

kin

the

bo

x4

.16

8.1

99

.80

11

.12

7.3

16

.27

8.8

76

.57

5.8

25

.46

Bac

kin

the

case

0.0

08

.19

9.8

01

1.1

24

.98

6.2

78

.87

2.0

85

.82

1.1

0

Bac

kin

the

tow

n0

.00

8.1

99

.80

11

.12

4.1

96

.27

8.8

73

.61

5.8

22

.64

Ou

to

fth

ew

ater

2.4

88

.44

9.6

21

1.1

27

.19

7.1

38

.09

5.8

96

.57

3.1

8

Ou

to

fth

eli

qu

id0

.00

8.4

49

.62

11

.12

3.5

87

.13

8.0

92

.30

6.5

70

.00

Ou

to

fth

ep

ud

din

g0

.00

8.4

49

.62

11

.12

3.3

07

.13

8.0

90

.69

6.5

70

.00

Ap

iece

of

toas

t3

.61

10

.74

6.9

49

.62

6.5

06

.03

6.5

94

.70

5.8

94

.34

Ap

iece

of

mea

t0

.00

10

.74

6.9

49

.62

3.1

46

.03

6.5

91

.10

5.8

90

.00

Ap

iece

of

bri

ck0

.00

10

.74

6.9

49

.62

4.1

16

.03

6.5

90

.69

5.8

90

.00

It’s

tim

efo

rlu

nch

2.2

09

.35

7.5

78

.88

6.5

35

.11

4.6

84

.84

4.0

12

.40

It’s

tim

efo

rso

up

0.0

09

.35

7.5

78

.88

3.0

05

.11

4.6

80

.00

4.0

10

.00

It’s

tim

efo

rd

rum

s0

.00

9.3

57

.57

8.8

82

.08

5.1

14

.68

0.0

04

.01

0.0

0

Ab

ow

lo

fco

rnfl

akes

2.2

01

0.7

46

.28

9.6

26

.19

4.5

44

.36

3.9

33

.81

2.5

6

Ab

ow

lo

fb

iscu

its

0.0

01

0.7

46

.28

9.6

25

.97

4.5

44

.36

3.2

63

.81

0.0

0

Ab

ow

lo

ffl

ow

ers

0.0

01

0.7

46

.28

9.6

26

.21

4.5

44

.36

3.5

03

.81

0.6

9

Hav

ea

nic

ed

ay2

.64

9.4

81

0.7

48

.54

7.3

27

.92

7.0

64

.50

4.7

64

.23

Hav

ea

nic

eh

ou

r0

.00

9.4

81

0.7

48

.54

4.3

67

.92

7.0

60

.00

4.7

60

.00

Hav

ea

nic

em

eal

0.0

09

.48

10

.74

8.5

45

.12

7.9

27

.06

2.0

84

.76

1.9

5

Yo

ub

um

ped

yo

ur

hea

d3

.66

11

.15

5.1

29

.43

6.8

04

.36

4.4

76

.14

4.0

44

.08

Yo

ub

um

ped

yo

ur

leg

0.0

01

1.1

55

.12

9.4

34

.89

4.3

64

.47

3.1

84

.04

0.0

0

Yo

ub

um

ped

yo

ur

toy

0.0

01

1.1

55

.12

9.4

35

.61

4.3

64

.47

3.2

64

.04

0.0

0

Wh

ata

fun

ny

no

ise

2.7

79

.70

10

.74

6.6

26

.88

6.1

65

.60

4.8

03

.33

4.6

2

Wh

ata

fun

ny

sou

nd

0.0

09

.70

10

.74

6.6

26

.08

6.1

65

.60

1.1

03

.33

0.6

9

Wh

ata

fun

ny

cup

0.0

09

.70

10

.74

6.6

26

.47

6.1

65

.60

0.0

03

.33

0.0

0

Let

’sh

ave

alo

ok

5.5

67

.80

9.4

81

0.7

49

.08

5.9

07

.92

6.8

05

.80

6.7

4

Let

’sh

ave

see

0.0

07

.80

9.4

81

0.7

48

.86

5.9

07

.92

1.6

15

.80

0.6

9

Let

’sh

ave

ath

ink

0.0

07

.80

9.4

81

0.7

49

.21

5.9

07

.92

1.3

95

.80

1.1

0

No

te.

1w

asad

ded

toal

lfr

equ

enci

esb

efo

reta

kin

gth

elo

gar

ith

ms

toac

com

mo

dat

efr

equ

enci

eso

f0

.F

or

all

mo

del

sre

po

rted

inth

eb

od

yo

fth

ep

aper

we

also

cond

uct

edal

tern

ativ

ean

aly

ses

inw

hic

hP

Cs

wer

eb

uil

tu

sin

glo

gfr

equ

enci

esin

wh

ich

we

add

edv

alu

esat

inte

rval

sb

etw

een

0.0

00

00

00

00

01

and

1.T

he

pat

tern

of

resu

lts

(th

eo

utc

om

eo

fal

lm

od

elco

mp

aris

on

s)w

asfo

un

dto

be

the

sam

ere

gar

dle

sso

fth

ev

alu

ead

ded

.


Appendix B

Syntactic types of stimuli sentences

Prepositional phrases

Back in the box | case | town

Out of the water | liquid | pudding

Noun phrases

A bowl of cornflakes | biscuits | flowers

A piece of toast | meat | brick

Sentences

You bumped your head | leg | toy

What a funny noise | sound | cup

Let’s have a look | see | think

Have a nice day | hour | meal

It’s time for lunch | soup | drums

Supporting Information

Additional Supporting Information may be found in the

online version of this article:

Appendix S1: Principal components analysis for stim-

uli frequencies

Appendix S2: Checking our models for collinearity

Appendix S3: Obtaining human similarity judgments

for evaluating sequence typicality

Please note: Wiley-Blackwell is not responsible for

the content or functionality of any supporting materials

supplied by the authors. Any queries (other than missing

material) should be directed to the corresponding author

for the article.


children’s production of unfamiliar word sequences is predicted by positional variability and...

Documents