children’s production of unfamiliar word sequences is predicted by positional variability and...
TRANSCRIPT
Children’s Production of Unfamiliar Word Sequences IsPredicted by Positional Variability and Latent Classes in a
Large Sample of Child-Directed Speech
Danielle Matthews,a Colin Bannardb
aDepartment of Psychology, University of SheffieldbDepartment of Linguistics, University of Texas at Austin
Received 23 May 2009; received in revised form 18 November 2009; accepted 23 November 2009
Abstract
We explore whether children’s willingness to produce unfamiliar sequences of words reflects their
experience with similar lexical patterns. We asked children to repeat unfamiliar sequences that were
identical to familiar phrases (e.g., A piece of toast) but for one word (e.g., a novel instantiation of Apiece of X, like A piece of brick). We explore two predictions—motivated by findings in the statisti-
cal learning literature—that children are likely to have detected an opportunity to substitute alterna-
tive words into the final position of a four-word sequence if (a) it is difficult to predict the fourth
word given the first three words and (b) the words observed in the final position are distributionally
similar. Twenty-eight 2-year-olds and thirty-one 3-year-olds were significantly more likely to cor-
rectly repeat unfamiliar variants of patterns for which these properties held. The results illustrate
how children’s developing language is shaped by linguistic experience.
Keywords: Cognitive development; Language acquisition; Statistical learning; Syntax; Corpus
analysis; Information theory; Latent classes; Usage-based models of language
1. Introduction
Faced with a stream of speech sounds and gestures, most infants begin to identify the
units of their language and discover the potential for recombining them within the first
2 years. Quite how this is achieved is one of the most challenging questions in cognitive sci-
ence. In the last decade, a very large literature has explored a number of skills that might be
useful. It has been reported that children can use basic ‘‘statistical learning’’ mechanisms to
take such crucial developmental steps as segmenting the input into ‘‘word-like’’ units
Correspondence should be sent to Danielle Matthews, Department of Psychology, University of Sheffield,
Western Bank, Sheffield S10 2TP United Kingdom. E-mail: [email protected]
Cognitive Science 34 (2010) 465–488Copyright � 2010 Cognitive Science Society, Inc. All rights reserved.ISSN: 0364-0213 print / 1551-6709 onlineDOI: 10.1111/j.1551-6709.2009.01091.x
(e.g., Saffran, Aslin, & Newport, 1996), assigning sounds to ‘‘categories’’ based on their
co-occurrence with other sounds (Gomez & Lakusta, 2004) and identifying nonadjacent
dependencies (Gomez, 2002; Gomez & Maye, 2005). This research has been conducted
using artificial stimuli—sequences of meaning-free sounds from which the infants are able
to extract language-like structure using simple pattern detection. The use of such artificial
stimuli is valuable in isolating specific input characteristics and learning mechanisms. How-
ever, it remains unclear whether these same mechanisms would be at work in a natural
learning context. Natural language is of course far noisier than artificial stimuli and rarely
displays patterns or statistical structure with the same clear consistency. Crucially, while
infants seem to be able to observe patterns in synthetic data from a very young age, it is not
clear that they are able to utilize these skills in communicative contexts until sometime later
in development. There is thus some work to be done to bridge the gap between these extre-
mely valuable findings and real language development (Pelucchi, Hay, & Saffran, 2009;
Johnson & Tyler, in press on word segmentation in natural language).
In this paper, we report on a study that examines children’s grammar learning by per-
forming a statistical analysis of a large sample of real input data and using this to make pre-
dictions about children’s ability to produce particular sequences of words in a sentence
repetition task. The sentence repetition task allows us to test young children, on the cusp of
multiword speech, with a procedure that has been tried and tested by many researchers from
differing theoretical backgrounds (e.g., Bannard & Matthews, 2008; Potter & Lombardi,
1990; Valian & Aubry, 2005). Using real English of course has some disadvantages, namely
that it can be challenging to find sufficient stimuli (where the properties of interest are
uncorrelated) while also controlling for other factors that would be presumed to affect pro-
duction (e.g., word frequency, phonological complexity). However, we think that it is a vital
complement to the artificial grammar learning work, and one of our objectives in this study
is to show that it is possible to control for many potential confounds via computational anal-
ysis of the input data and the use of appropriate methods for statistical analysis of the chil-
dren’s responses.
The aim of this study is to test whether the detailed statistics of the input are reflected in
children’s developing grammatical representations. We asked children to repeat unfamiliar
sequences of words that were identical to familiar phrases but for one word (e.g., a novel
instantiation of a frequent pattern like A piece of X, such as A piece of brick). These variants
were unattested in a large child language corpus and thus likely to be novel to most young
children or, at the least, unpracticed. We hypothesized that children’s ability to repeat such
unattested sequences would reflect their exposure to the relevant pattern in the given lexical
form. We thus rely on the assumption that children build lexically specific representations.
This assumption has been supported in a recent study (Bannard & Matthews, 2008) where
we found that 2- and 3-year-old children were significantly better at repeating the shared
first three words of frequently occurring multiword sequences than matched, infrequent
sequences (e.g., better at repeating ‘‘sit in your’’ when saying ‘‘sit in your chair’’ than when
saying ‘‘sit in your truck’’). It is worth noting that lexical patterns of the kind we are study-
ing here have been given a central role in so-called usage-based theories of development
(e.g., Tomasello, 2003; Goldberg, 2006), where they are sometimes referred to as
466 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
‘‘constructions.’’ Because of the long history of the term construction in the linguistic
literature and some minor differences in how the term is applied even within the usage-
based literature, we prefer to use the terms schema or pattern in this article, but we
nonetheless consider the phenomenon we are discussing as entirely consistent with such
an approach.
So how might the statistics of the input affect children’s ability to produce unfamiliar
sequences of words that are similar to well-known phrases? One recurrent idea in the
literature on the learning of linguistic patterns is that children will be affected by what
has been called type frequency. The idea here is that children will identify a pattern in
the input where some invariant structure is combined with a wide range of other mate-
rial. For example, Gomez (2002) found that the ability of 18-month-olds to detect a
nonadjacent dependency between two sounds was predicted by the extent to which the
intervening element was varied in the artificial language they were exposed to. This idea
has also been popular in the study of morphology and its development (e.g., Bybee,
1985; Kempe, Brooks, Mironova, Pershukova, & Fedorova, 2007). Similar mechanisms
have been proposed for the learning of basic lexical patterns of the kind we are discuss-
ing here (e.g., Braine, 1976; Edelman, 2007; Freudenthal, Pine, Aguado-Orea, & Gobet,
2007; Lieven, Pine, & Baldwin, 1997; Pine & Lieven, 1997). Tomasello (2003) has
argued that children form the most basic of productive constructions through a process
of schematization. This is achieved when children hear repeated uses of one form (e.g.,
‘‘Throw’’) along with varied use of another form (e.g., noun phrases referring to what-
ever is thrown: ‘‘Throw the ball,’’ ‘‘Throw teddy,’’ and ‘‘Throw your bottle’’) in similar
contexts. The outcome is a linguistic construction that contains a minimum of one lexi-
cal item and one ‘‘slot’’ (Throw X).
Type frequency can thus be used to quantify how appropriate it is to generalize over a set
of similar utterances. One problem with type frequency, however, is that it does not take
into account the frequency distribution of the words filling a given slot. For example, if a
child hears the sequence ‘‘Throw your bottle’’ 118 times and ‘‘Throw the ball’’ and
‘‘Throw teddy’’ only once each, then we might not expect the same degree of productivity
with a potential ‘‘Throw X’’ construction as if all three sequences had been heard 40 times
each (although the type frequency would have been three in both cases). In the former
‘‘unequal’’ case, the child will always expect to hear ‘‘your bottle’’ after ‘‘throw’’ and thus
might not detect any potential for productivity. In the latter ‘‘equal’’ case, the child will be
uncertain as to which of three possible options will occur and therefore might be more likely
to form a productive slot. The intuitive difference between these situations can be quantified
with a measure of the entropy (Shannon & Weaver, 1949) of the slot, an index of the uncer-
tainty about which of all the possible words that could fill a slot is most likely to occur (see
also Hale, 2006; Keller, 2004; Levy, 2008; Moscoso del Prado Martın, Kostic, & Baayen,
2004; Moscoso del Prado Martın, Kostic, & Filipovic-Djurdjevic, unpublished data). This
entropy, which we will refer to as slot entropy, can be calculated as follows, where X is a
slot, each x is a word that appears in that slot, and p(x) is the probability of seeing each x in
that position. In the above example, then, the entropy in the unequal case is 0.14 and in the
equal case it is 1.58.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 467
HðXÞ ¼ �X
x2XpðxÞ log2 pðxÞ
Following the same reasoning as for type frequency, children should be more competent at
producing an unfamiliar sequence when it is an instantiation of a pattern for which a con-
crete alternative is maximally unpredictable (a pattern with high slot entropy). For example,
given two highly frequent utterances, ‘‘Back in the box’’ and ‘‘Let’s have a look,’’ that dif-
fer in the slot entropy for the final word position (in the corpus we used, the slot entropy for
‘‘Back in the X’’ was 5.31, for ‘‘Let’s have a X’’ it was 1.24), children should be more
likely to accept unfamiliar versions of the sequence that has the greater slot entropy than the
sequence with lower slot entropy (e.g., the unfamiliar sequence ‘‘Back in the town’’ should
be easier to produce than the unfamiliar sequence ‘‘Let’s have a think’’). Thus, the degree
to which children will be willing to extract and utilize invariant patterns will depend on the
entropy of its slot(s).
We predict, then, that a child will extract a productive pattern (identify a frame and a slot)
where there is high entropy. However, the problem is not as simple as determining where
there is and is not a slot. Children also face the problem of predicting what is allowed to
appear there—forming expectations about not only the exact words seen in a particular
position but also the kind of words to be seen. That is, children should have expectations
concerning whether a given word or phrase will be seen in a particular position based on its
similarity to the words that have been seen there before. Our target sequences were designed
to investigate the effect of latent classes—grouping of similar words—on children’s devel-
oping knowledge. The idea that speakers have knowledge about how words are similar to
other words is of course very widely accepted in linguistic theory—it is the basis for syntac-
tic categories. How exactly they determine this similarity is, however, not so clear. One way
in which words are similar to other words is in the similarity of the words or concepts to
which they are used to refer. However, although we know that human infants are remarkably
good at generalizing across stimuli that are similar (e.g., Shepard, 1987), gauging effects of
semantic similarity is notoriously difficult because of the lack of a widely accepted theory
of mental representation and semantic cognition.
Another way in which words display similarities and dissimilarities is in their distribution
relative to other words (Harris, 1964). Learners also seem to be able to exploit this informa-
tion. For example, it has been shown in an artificial grammar learning study (Gomez &
Lakusta, 2004) that children are able to infer similarity between words from the contexts in
which they occur (see also Monaghan & Christiansen, 2008 for an extensive investigation
of how children might cluster words together using a number of probabilistic phonological
cues).
In this study, we do not attempt to distinguish between these two sources of similarity.
We employ distributional information and operationalize similarity between words by cal-
culating the overlap in their contexts as they occur in a corpus of child-directed speech.
However, we cannot be sure whether this measure is the basis that children use to infer simi-
larity. It has long been acknowledged that distributional and semantic similarity are likely to
be highly intercorrelated, and that words that have similar meanings will occur in similar
468 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
contexts (see Landauer & Dumais, 1997 for a broad overview of the distributional approach
to meaning). In this experiment, we are concerned simply with whether the children exploit
the similarity in inferring lexical patterns from the input, and not with the origin of their
detection of that similarity.
Our second prediction, then, is that children will be more likely to detect the potential for
productivity in a four-word sequence and be better at repeating novel instantiations of it
when the relevant position has tended to be filled with (semantically or distributionally) sim-
ilar items. We measure the similarity of the items that have been seen to go into particular
slots by looking at how similar the contexts in which they appear are. For all words found in
our slots we look at the words that occur two words before and two words after the item in a
large corpus of child-directed speech. We record the number of times that each word in the
vocabulary occurs within this window. This then gives us a co-occurrence vector for each
word, with each entry in the vector representing a dimension in a multidimensional space
(where the dimensions are the vocabulary of the language). The similarity between any two
words is then taken to be the cosine of the angle between those two vectors (a value between
0 and 1 with higher values indicating greater similarity). In order to calculate the overall
cohesiveness of a slot (i.e., the homogeneity or the semantic density of the words previously
seen to fill it), we obtained the mean pairwise distance of each word that occurred in that
slot from each other word that occurred there. We call this measure slot semantic densityand calculate it for the final position slot, X, of each sequence containing N different words
as follows:
Semantic Density ðXÞ ¼ 1
N2 �N
X
x2X
X
y 6¼x2Xcos ðx; yÞ
If children are sensitive to the semantic density of a slot, then they might find it easier to
produce unfamiliar versions of a four-word sequence if the final slot has both high entropy
and high semantic density. For example, given two highly frequent utterances with high slot
entropy, ‘‘Back in the box’’ and ‘‘A piece of toast,’’ that differ in the semantic density for
the final word position (for ‘‘Back in the X’’ the semantic density is 0.63, for ‘‘A piece of
X’’ the semantic density is 0.39), children might be more likely to accept unfamiliar ver-
sions of the sequence that has the greater semantic density than the sequence with lower
semantic density.
Of course, whether such an effect of semantic density holds may depend on the nature of
the final word in the unfamiliar sequence. Thus, variants of ‘‘Back in the box’’ might only
be easy to repeat if the final word is semantically similar to other words attested in that slot
(e.g., containers like ‘‘case’’ or ‘‘fridge’’). In order to test this we selected items that were
semantically similar (‘‘case’’) or dissimilar (‘‘town’’) to words seen in the relevant position
in the corpus (see the Method section for details). We refer to the former kind of word as
semantically ‘‘typical’’ and the latter kind as semantically ‘‘atypical.’’
We thus predicted that unfamiliar sequences would be easier to repeat if they were ver-
sions of a construction with high slot entropy and high semantic density and if the final word
were semantically typical for that slot. Our predictions for these unfamiliar sequences rest
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 469
on the expectation that the child should not have often uttered them before (if at all) and that
they should accordingly be processed as generalizations. To further investigate this pro-
posal, we also tested familiar sequences that could in principle be retrieved directly from
memory. Our predictions here were more speculative. We have previously found (Bannard
& Matthews, 2008) that children are better at repeating sequences of words that they have
frequently encountered before, and it is not clear how having formed a generalization over
similar sequences might affect their facility with such familiar instances. One might predict
that highly frequent stored sequences will be unaffected by the presence of related items.
On the other hand, the possibility of generalizations might actually inhibit the production of
familiar word sequences, so that high-frequency items that instantiate low-entropy patterns
might be expected to be more fluently produced than their high-entropy counterparts.1 The
effect of semantic density on well-integrated familiar sequences could also plausibly be ben-
eficial or detrimental, as having many semantically similar neighbors could presumably
either inhibit or enhance production of the sequence (c.f. Magnuson, Mirman, & Strauss,
2007). Note that, as the final word of a familiar sequence is likely to be semantically typical,
we did not vary this factor for familiar sequences.
To summarize, in the current study we analyzed the properties (slot entropy and semantic
density) of four-word schemas that had a lexically specified three-word stem plus a final slot
(we henceforth refer to these as schemas) as observed in a large database of British English
child-directed speech. We tested how these properties affected children’s ability to repro-
duce unfamiliar variants of these schemas and checked whether these effects were mediated
by the semantics of the final word in the unfamiliar target. We also checked whether these
same properties would affect the repetition of familiar sequences (although these could not
be fully matched to the unfamiliar sequences for all control variables). We tested children’s
ability to comprehend and produce the 27 sequences given in Table 1 by playing them
recordings and asking them to repeat them.
2. Method
2.1. Participants
Fifty-nine normally developing, monolingual, British English-speaking children were
included in the study (32 boys). There were twenty-eight 2-year-olds (range 2.3–2.10, mean
age 2.7) and thirty-one 3-year-olds (range 3.1–3.7, mean age 3.4). A further 18 children
were tested but not included because of fussiness or inaudible responding. The children
were tested in a university laboratory in the United Kindom or in a quiet room in their day
care center.
2.2. Materials and design
The stimuli for each child consisted of nine triplets of four-word sequences.
These sequences were selected using a child language corpus, the largest available to us,
470 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
containing the speech directed to one child, Brian, between the ages of 2 and 5 years
recorded in Manchester, UK (Max Planck Child Language Corpus: 1.72 million words of
maternal speech). We chose to look at four-word sequences because previous studies have
demonstrated that these are sufficiently long to elicit variance in participants’ performance
in a repetition task (Bannard & Matthews, 2008; Valian & Aubry, 2005). We extracted all
repeated sequences of words from the corpus using the method described in Yamamoto and
Church (2001) and discarded all sequences that formed a question (as children might be
tempted to answer a question rather than repeat it). Applying this filter meant that our most
frequent item was ‘‘I don’t know what’’ which occurred 260 times (a natural log frequency
of 5.56). Our log frequency range was then taken to be 0–5.56.
We next identified all sequences of four words that began with the same first three words
(had the same schema). We identified all schemas for which at least one instantiation was in
the top two-thirds of the log frequency range (so that we would have at least one familiar
example for later use). For each of these schemas, we calculated the slot entropy and slot
semantic density for the fourth word position, as outlined in the introduction. We then
Table 1
Stimulus sequences and their properties
Sequence Familiarity Slot Entropy Semantic Density Typicality of Fourth Word
Out of the water High 6.17 0.58 Typical
Out of the liquid Low 6.17 0.58 Typical
Out of the pudding Low 6.17 0.58 Atypical
Back in the box High 5.31 0.63 Typical
Back in the case Low 5.31 0.64 Typical
Back in the town Low 5.31 0.64 Atypical
A piece of toast High 5.16 0.39 Typical
A piece of meat Low 5.16 0.39 Typical
A piece of brick Low 5.16 0.39 Atypical
Have a nice day High 4.37 0.46 Typical
Have a nice hour Low 4.37 0.46 Typical
Have a nice meal Low 4.37 0.46 Atypical
It’s time for lunch High 3.78 0.40 Typical
It’s time for soup Low 3.78 0.40 Typical
It’s time for drums Low 3.78 0.40 Atypical
A bowl of cornflakes High 2.83 0.37 Typical
A bowl of biscuits Low 2.83 0.37 Typical
A bowl of flowers Low 2.83 0.37 Atypical
What a funny noise High 2.11 0.46 Typical
What a funny sound Low 2.11 0.46 Typical
What a funny cup Low 2.11 0.46 Atypical
You bumped your head High 2.10 0.60 Typical
You bumped your leg Low 2.10 0.60 Typical
You bumped your toy Low 2.10 0.60 Atypical
Let’s have a look High 1.24 0.46 Typical
Let’s have a see Low 1.23 0.46 Typical
Let’s have a think Low 1.23 0.46 Atypical
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 471
ordered these schemas according to slot entropy and identified items that spanned the range
of observed values. The second key factor that we wish to explore in this paper is the impact
of semantic density, and thus it was important that we cross this with slot entropy in our
stimuli. For this purpose we (for the purposes of item selection only; slot entropy was trea-
ted as a continuous variable in all our analyses) put the items into bands of high, medium,
and low slot entropy bands and for each we selected schemas that spanned the range of pos-
sible semantic density values as much as possible. Our need to meet all of these criteria
meant that we had little freedom in selecting the stimuli. Thus, it was not possible to select
schemas of a particular syntactic type or types. The effect on learning that we hypothesize
the factors of slot entropy and semantic density to have might be expected to interact with
the child’s developing knowledge of syntactic types or categories (they might e.g., expect
differing degrees of semantic flexibility for a slot in a noun phrase that in a verb phrase).
Nonetheless, we would predict that their effect should be seen across syntactic types. We
therefore chose to select the items that maximize the spread of our key predictors, leaving
the impact of syntactic type as to be considered in our statistical analysis. The distribution
of the items across our key predictor variables can be seen in Fig. 1. Our items reflect multi-
ple syntactic types. One might for example, divide the stimuli set into prepositional phrases
(back in the X, out of the X), noun phrases (a bowl of X, a piece of X), and sentences (you
bumped your X, what a funny X, let’s have a X, have a nice X, it’s time for X). While cer-
tain syntactic types appear to cluster together here (e.g., the prepositional phrases), there is
no absolute correlation between syntactic type and our factors. We will later explore the
impact of this grouping (which is detailed again in Appendix B for convenience of refer-
ence) in our data analysis.
Fig. 1. Distribution of test items.
472 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
Having identified our schemas, we then obtained one familiar sequence (seen in the corpus
with reasonable frequency) and two unfamiliar sequences (plausible sequences that were
nonetheless unseen in our corpus for each schema). The familiar sequence was obtained from
the top two-thirds of the overall log frequency range of four word sequences. However, it is
important to note that it was not always the most frequent instantiation of the schema, and
that the schemas were rarely dominated by any one sequence (the highest frequency
instantiations of each of our schemas accounted for a mean of 36% of instances). On average,
our selected high-frequency items accounted for 31% of the instantiations of the schema.
In order to create two unfamiliar items for each schema, we used the WordNet Lexical
database v2.1 (Fellbaum, 1998; WordNet is an IS-A hierarchy (in the sense an apple IS A
fruit) created in the psychology department at Princeton University that represents semantic
relations between English words) to identify one word that was highly similar to the final
word of the selected familiar sequence and one that was semantically dissimilar from this
(all nouns cited in appropriate sense in the The Oxford English Dictionary, 1989). Within
WordNet, our unseen typical words were in all cases a maximum of five nodes away from
the seen words (the threshold on similar pairs proposed by Hirst & St-Onge, 1998). In two
cases, the unseen word was a direct hypernym of the seen word (water => liquid) or vice
versa (noise => sound). In another two cases, the two words were linked by a direct hyper-
nym of both words (box => container <= case; day => time unit <= hour), and in all other
cases except one (lunch => meal => nutriment <= dish <= soup) they were linked via a node
that was an immediate hypernym of one of the pair (e.g., toast => bread => baked goods =>
food <= meat). We refer to the former, similar items as ‘‘typical’’ and the latter as ‘‘atypi-
cal.’’ In order to verify the typicality or otherwise of these words for each given schema, we
obtained human judgments as to their similarity to the words seen in the schema over the
corpus (see Appendix S3 for details). For all but one of the schemas the typical word was
judged to be more similar (on average) to the items seen in the schema over the corpus than
was the atypical word. Pairs of typical and atypical words were matched for their length in
syllables and, as far as possible, their frequencies (see Appendix A).
As mentioned above, for each of the nine schemas, we attempted to control for differ-
ences in the fourth word frequencies as far as possible (the first three words were identical).
However, it was not possible to match the frequency of the final word, bigram, or trigram of
the unfamiliar items with the familiar items. Similarly it was not possible to control the fre-
quency of component words, bigrams, or trigrams across different schemas. As we would
expect these component frequencies to affect children’s ability to repeat sequences, we fac-
tored their effect out by including them as predictors in all regression models. The 10 fre-
quency counts for each four-word sequence (i.e., the frequency of the four-word sequence
and its four component words, three component bigrams, and two-component trigrams) are
given in Appendix A.
To allow us to evaluate the impact of all these separate frequencies without introducing
multicollinearity into our models, we reduced the counts to orthogonal dimensions using
principal components analysis. We did this separately for the familiar and unfamiliar
items as they were intended to be used in separate analyses. We retained all factors with
Eigenvalues greater than 1 which left us with four components for the unfamiliar items
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 473
(accounting for 95% of the total variance), and three components for the familiar items
(accounting for 93% of total variance). A fuller description of this procedure and a discus-
sion of the loadings for the selected components can be found in Appendix S1.
To summarize, this procedure gave us, for each schema, one familiar (high-frequency)
sequence and two unfamiliar (unseen) sequences. One of the unfamiliar sequences had a
final word that was semantically similar to the familiar item observed in this position in our
corpus (unfamiliar, typical) and the other has a dissimilar final word (unfamiliar, atypical).
The final 27 stimulus sequences and their properties are presented in Table 1.
All sequences were read by a female British English speaker with normal declarative
intonation and recorded in a soundproof booth onto a computer disk with a sampling fre-
quency of 44,100 Hz using SoundStudio v.3 (Freeverse, New York, NY, USA). To ensure
that the first three words of all matched sequences were identical, we took one sequence as a
base and created the matched pair by splicing in the final word using the open-source
Audacity software v.1,2.4. We used randomly selected familiar sequences, unfamiliar typi-
cal sequences, and unfamiliar atypical sequences as bases for a third of the items each.
To ensure that sequences of the same schema type were not encountered in close succes-
sion, test items were presented in three blocks of nine items with each block containing one
of the variants of a schema in one of two fixed orders (one the reverse of the other), such
that each of the three sequences belonging to the same schema was always nine items apart.
All three blocks contained an equal number of familiar and unfamiliar sequences and typical
and atypical items. These blocks were presented in six orders, with order of presentation
counterbalanced for each age group.
2.3. Procedure
The experimenter, E, sat with the child at a table in front of a computer (the child either
sat alone or on a parent’s knee). E produced a picture of a tree with several stars in the
branches and explained they would cover each star with a parrot sticker. E explained that, to
get the stickers, they needed to listen to what the computer would say and then say the same
thing. Every time they did so, part of a cartoon parrot would appear on the computer. Once
they could see the whole parrot (which appeared every three trials), they would get a parrot
sticker. E proposed to have a go first. She then clicked on a mouse to play the first of six
example sequences and repeated the sequence. She repeated this for the next two example
sequences, at which point a full parrot was visible and so E awarded herself a sticker before
offering the child a turn. The final three example sequences were used for the child to prac-
tice the procedure. E helped the child or replayed the practice sound files once each if neces-
sary. Each time the child had attempted to repeat three sequences s ⁄ he was given a sticker.
E then played the test sequences in exactly the same manner except that no help was given,
no sound files were replayed, and E did not help the child repeat anything. If the child did
not spontaneously repeat a sequence after a reasonable delay, E prompted the child once
(saying Can you say that?). If the child did not then respond, or if anything other than this
prompt came between the stimulus sequence and the repetition, the response was excluded.
Responses were also excluded if the child did not hear the stimulus sequences (e.g., if the
474 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
child spoke unexpectedly as the sound file played. In total, 148 of a possible total of 1,593
responses were excluded). The procedure continued until all 27 sentences were repeated.
Responses were recorded onto computer disk using Audacity recording software.
2.4. Transcription and error coding
Each word in each sequence was coded for the presence or absence of the errors in
Table 2. (The use of such criteria was found in previous studies to improve coder accuracy
in comparison to a procedure where coders directly coded the accuracy of each whole three-
word stem as correct or incorrect.) If the child did not make a single error on the first three
words of the sequence, this sequence was coded as correctly repeated; otherwise it was
incorrect. We did not consider errors made on the fourth word as our focus here was on the
child’s competence with the schema and we wished to minimize the impact of the phonetic
details of the novel item. If a child did not respond to an item, it was discarded along with
the other items in that schema. Two research assistants blind to the hypotheses of the experi-
ment transcribed and coded all the children’s responses from audio files. Agreement
between these coders was good (Agreement: 82%, Cohen’s kappa = 0.62). A third research
assistant, also blind to the hypotheses of the experiment, checked all cases in which the first
two coders did not yield identical coding for each word, listened to the relevant response,
and resolved the discrepancy.
3. Results
All of the children attempted to repeat the vast majority of items (1,445 observations in
total). The 2-year-olds repeated the first three words of 21% of the unfamiliar sequences and
30% of the familiar sequences correctly. The 3-year-olds repeated the first three words of
49% of the unfamiliar sequences and 54% of the familiar sequences correctly. As noted in
the method, this apparent frequency effect may stem from the frequency of the four-word
sequences or their component words, bigrams, or trigrams (because these counts are highly
Table 2
Error codes used for children’s responses
Code Error
Repetition Whole word or one syllable of the word is repeated.
Deletion Whole word is missing.
Insertion Insertion of a word or isolated phonetic material before the target word.
Substitution Target word substituted for another different word.
Mispronunciation Target word is missing a phoneme, has a phoneme inserted, or is a morphological
variant of the target word (e.g., ‘‘bump’’ instead of ‘‘bumped’’ in ‘‘you bumped
your head’’). Missing phonemes that yielded a pronunciation compatible with
adult speech and regional dialect (e.g., ‘‘back int box,’’ which is acceptable in
northern England) were not scored as errors. The pronunciation of ‘‘the’’ as
‘‘de’’ was also accepted.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 475
correlated). We will not discuss frequency effects here but rather include in all models the
four frequency scores derived through principal components analysis (see Appendix S1).
Because of the need to factor out these confounds before the effect of our factors of interest
can be usefully observed, we do not present raw data here.
To investigate the relationship between correct repetition of the first three words of a
sequence and the factors of current interest, we fitted mixed effects logistic regression mod-
els to the data using Laplace approximation (Baayen, 2008; Baayen, Davidson, & Bates,
2008; Dixon, 2008; Gelman & Hill, 2007; Jaeger, 2008). The outcome variable in all models
was whether the first three words of a sequence were correctly repeated (1) or not (0). Child
(N = 59) was added to all models as a random effect on the intercept in order to account for
individual differences. We also ran models with extra random effects for the nine schema
types and the 27 final words of each sequence, but the variance for these factors was always
extremely low—standard deviation always <0.001. We therefore did not include the schema
and item variables in our reported analyses. Including these random effects did not change
the statistical outcome of the results, and models with item and ⁄ or schema included as ran-
dom effects provided a substantially poorer fit to the data (a substantially higher AIC score)
than models including our selected fixed effect predictors. Taken together these finding indi-
cate that item differences other than the manipulated or controlled variables had minimal
impact on the children’s performance. Finally, for all models, we tried introducing the syn-
tactic type of the schema into the model as random effects. We discuss the impact of this on
our models below. All noncategorical predictors were centered by calculating the mean for
the variable and subtracting it from each value. In Appendix S2, we report on an extensive
analysis of the relationship between our predictors, looking for sources of multicollinearity,
and suggest that we can be confident in the analyses presented here.
Putting the control variables into our model allowed us to examine the effect of the fol-
lowing manipulated variables:
1. Age (2 or 3 years old)
2. Slot entropy (continuous)
3. Semantic density (continuous)
4. Final word typicality (typical or atypical)
The principal question of interest is whether these factors affect children’s ability to
repeat unfamiliar sequences. We therefore first fitted a model to the repetition data for the
novel sequences. We added each of these variables to the model in order to examine their
predictive value over and above our controls. We use likelihood ratio tests to compare
nested models and Akaike’s information criterion (AIC) values to compare nonnested mod-
els. We also report McFadden’s log-likelihood ratio index (LLRI; McFadden, 1974) as a
measure of the practical significance of the differences between models.2 First of all, age
was found to lead to a significant improvement in fit when added to a model with only our
controls as predictors (v2(1) = 23.3, p < .0001, LLRI = 0.021). Further, adding slot entropy
again substantially improved the fit of the model (v2(1) = 8.25, p < .005, LLRI = 0.008).
Adding semantic density to a model containing our controls and age did not lead to a
476 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
significant improvement in fit (v2(1) = 1.15, p = .284, LLRI = 0.001), and the composite
model had a higher AIC than the model including the controls, age, and slot entropy, indi-
cating that slot entropy has greater predictive value than semantic density. However, a
model including the controls, age, slot entropy, and semantic density had a significantly bet-
ter fit, than a model containing only the controls, age, and slot entropy (v2(1) = 4.6,
p < .05, LLRI = 0.004), indicating that semantic density does have predictive value (once
slot entropy is accounted for) and accounts for additional variance over and above that
accounted for by slot entropy. The addition of the typicality of the test item as a predictor
offered no significant improvement in fit over a model that contained only the controls and
age (v2(1) = 0.31, p = .578, LLRI < 0.001). Similarly, it did not improve fit for models that
additionally contained slot entropy (v2(1) = 0.58, p = .455, LLRI < 0.001), semantic den-
sity (v2(1) = 0.25, p = .614,, LLRI < 0.001), or both (v2(1) = 0.54, p = .462, LLRI <
0.001), indicating that it had no predictive value. We similarly found that including the
human typicality ratings (in Appendix S3) as a continuous predictor gave no improvement
in fit when added to a model containing the controls plus age (v2(1) = 0.17, p = .676,
LLRI < 0.001) or when we additionally added slot entropy (v2(1) = 0.67, p = .41, LLRI <
0.001), semantic density (v2(1) = 0.29, p = .59, LLRI < 0.001), or both (v2(1) = 1.59,
p = .21, LLRI = 0.002).
In Table 3, we report on the parameters of a model (model 1) that contained all controls
and experimental variables. This had a significantly better fit to the data than a baseline
model that included only the random effect of child, control principal components, and age
as predictors (v2(3) = 13.36, p = .004, LLRI = 0.013). For this model, the estimated inter-
cepts for the children varied with a standard deviation of 1.03. Age, slot entropy, and seman-
tic density were all significant (positive) predictors, whereas typicality (included here at a
categorical value—the same pattern was obtained when including the mean human judg-
ments) was not. These results reflect the fact that 2-year-olds were more likely to make
errors than 3-year-olds and that schemas with higher slot entropy and higher semantic den-
Table 3
Fixed effects in model 1 fitted to data for unfamiliar sequences
B
HPD Intervals
SE Z p-ValueLower Upper
(Intercept) )0.67 )1.28 )0.09 0.28 )2.37 .018
Frequency PC1 )0.11 )0.43 0.14 0.14 )0.81 .42
Frequency PC2 0.31 0.09 0.55 0.12 2.69 .007
Frequency PC3 0.99 0.45 1.52 0.26 3.74 <.001
Frequency PC4 )0.09 )0.30 0.13 0.10 )0.85 .397
Age 0.84 0.54 1.22 0.16 5.27 <.001
Slot entropy 1.00 0.41 1.57 0.29 3.44 <.001
Semantic density 0.23 0.02 0.45 0.11 2.17 .030
Typicality )0.12 )0.44 0.19 0.16 )0.75 .455
Note. Concordance between the predicted probabilities and the observed responses, C = 0.838. Somer’s Dxy
(rank correlation between predicted probabilities and observed responses) = 0.676 (c.f. Baayen, 2008).
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 477
sity were more likely to be correctly repeated. They are thus consistent with the predictions
that ability to reproduce unseen forms will be greater when (a) children have less specific
expectations about what should occur in the final word position and (b) the items previously
attested in the final word position are more semantically homogenous. In addition to our
estimated maximum-likelihood parameters, we also report on a Bayesian analysis (as rec-
ommended by Baayen et al., 2008) in which we approximate the full posterior distribution
using Gibbs sampling. All model parameters were sampled from normal distributions with
noninformative priors (see section 17.4 of Gelman & Hill, 2007, for BUGS code for a simi-
lar mixed-effects logistic regression model). We show the lower and upper bounds of the
95% higher posterior density (HPD) intervals for each model parameter. This interval covers
95% of the posterior probability and provides a measure of uncertainty. That this interval
does not cross 0 for age, slot entropy, or semantic density gives us further confidence that
they are useful positive predictors of repetition performance.
To test for possible interactions between experimental factors, we ran a more complex
variant of model 1 adding all two-way interactions between age, slot entropy, semantic den-
sity, and typicality, again fitting the model to the data for the low-frequency sequences. This
model was not a significant improvement on model 1 (v2(6) = 8.44, p = .208, LLRI =
0.008) and did not reveal any significant interactions. Simpler variants of model 1, adding
only the interaction between either age and slot entropy or age and semantic density, also did
not give any significant improvement in fit over model 1 or reveal any significant interac-
tions, suggesting that children from both age groups were similarly affected by these factors.
Finally, we wanted to explore what impact the syntactic type of the frame might have.
We did this by adding syntactic class into our model as a random effect on the intercept,
using the classification found in Appendix B. Adding this to a baseline model including
only the control variables and age resulted in a significant improvement in fit (v2(1) = 5.1,
p < .024, LLRI = 0.005). However, this model had a higher AIC value that model 1 (indi-
cating that model 1 offers a better fit to the data). Furthermore, a model with child and syn-
tactic type as random effects on the intercept plus control variables, age, slot entropy,
semantic density, and typicality as fixed effects gave a significant improvement in fit over a
model including child and syntactic class as random effects with only control variables and
age as fixed effects (v2(3) = 8.1, p < .05, LLRI = 0.008). Revealingly when we added syn-
tactic class as a fixed effect to model 1 there was no improvement in fit (v2(1) = 1.54,
p = .22, LLRI = 0.001), suggesting that the variance accounted for by syntactic class is a
subset of that accounted for by our predictors. In summary, our predictors were seen to have
significant explanatory value over and above that provided by the pooling of variance by
syntactic class and the analysis offers strong support for the view that they apply across
phrases of different syntactic types.
Having considered how the properties of a four-word sequence affect the repetition of
unfamiliar sequences, an additional question of interest is whether slot entropy and semantic
density also affect the production of highly familiar word sequences. As it is very
difficult to predict whether high-frequency items would benefit or not from high entropy
(see Introduction), this analysis was more exploratory. We again investigated the value of
the various predictors via model comparison. Adding age to a model including only the
478 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
controls again resulted in a significant improvement in fit (v2(1) = 16.59, p < .0001, LLRI =
0.029). Unlike for the unfamiliar sequences, adding slot entropy to the model including the
controls and age resulted in no improvement in fit (v2(1) = 1.54, p = .214, LLRI = 0.003).
However, the addition of semantic density to the model did result in a significant improve-
ment in fit (v2(1) = 9.28, p < .005, LLRI = 0.016). Unlike for the unfamiliar sequences, this
did not depend on the inclusion of slot entropy. A model containing slot entropy and seman-
tic density in addition to the controls plus age offered no improvement in fit over one includ-
ing the controls, age, and semantic density alone (v2(1) = 0.19, p = .660, LLRI < 0.001).
Table 4 reports on the parameters for a model (model 2) including all the predictors
(except of course typicality, which was not varied for high-frequency sequences) to the data
for the high-frequency items. Age was again a significant positive predictor, with 2-year-
olds being more likely to make mistakes in repetition. Semantic density was found to be a
significant positive predictor, meaning that children were more likely to successfully repro-
duce a high-frequency sequence if the words that are typically seen in the last position of
the schema are highly similar. Slot entropy was not found to be a significant predictor. A
model including two-way interactions between age, slot entropy, and semantic density was
not found to be an improvement over model 2 (v2(3) = 3.88, p = .275, LLRI = 0.007), and
no interactions were found to be significant. The same applied for simpler models including
any combinations of two-way interactions together or in isolation. We again also performed
a Markov Chain Monte Carlo analysis and report HPD intervals for the models parameters.
Finally, we again wanted to explore what impact the syntactic type of the frame might
have. We did this once more by adding syntactic class into our model as a random effect on
the intercept, using the classification found in Appendix B. Adding this to a baseline model
including only the control variables and age did not result in a significant improvement in fit
(v2(1) = 0.008, p = .931, LLRI < 0.001). Adding syntactic class to model 2 as a random
effect resulted in no change in fit. Furthermore, model 2 had a much smaller AIC score
(569.6) than a model containing only the controls, age, and syntactic class as a random
effect (576.2). Thus, unlike for the unfamiliar sequences, the syntactic class of the sequence
seemed to have no effect on the children’s ability to produce the sequence.
Table 4
Fixed effects in model 2 fitted to data for familiar sequences
B
HPD Intervals
SE Z p-ValueLower Upper
(Intercept) )0.46 )0.85 )0.12 0.17 )2.74 .006
Frequency PC1 )0.47 )0.75 )0.21 0.13 )3.56 <.001
Frequency PC2 0.54 0.32 0.80 0.11 4.81 <.001
Frequency PC3 < )0.01 )0.48 0.47 0.24 0.05 .969
Age 0.72 0.37 1.15 0.17 4.30 <.001
Slot entropy 0.11 )0.38 0.64 0.25 0.45 .652
Semantic density 0.38 0.68 1.53 0.13 2.88 .004
Note. Concordance between the predicted probabilities and the observed responses, C = 0.845, Somer’s
Dxy = 0.689.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 479
4. Discussion
The current experiment set out to test whether the distributional properties of simple
four-word schemas (as estimated using a large corpus of child-directed speech) would affect
how accurately unfamiliar versions of the schemas will be repeated by young children. One
prediction was that the less certain a child is as to the way a sequence will end given the
statistics of maternal input (the greater the slot entropy), the more likely he or she will be to
form a basic generalization and hence the easier he or she would find it to produce an unfa-
miliar sequence. This indeed appears to be the case. Children in both age groups were better
able to reproduce unfamiliar sequences with higher slot entropy. The semantic properties of
a slot also affect ease of repetition of unfamiliar sequences. The more semantically similar
the items that are likely to have been previously heard in a slot, the easier it was for children
to repeat an unfamiliar variant of that schema. The patterns used in our experiment spanned
syntactic phrase types, and we found in our statistical analysis that slot entropy and semantic
density had predictive value over and above syntactic class, suggesting that they affect
learning across phrase types.
In contrast to our predictions, we observed no effect of the typicality of the final word in
the unfamiliar sequences (assessed using both a categorical distinction based on the Word-
Net hierarchy and human judgments) and no interaction between the semantic density of the
slots in our schemas and the typicality of our items (suggesting that producing a sequence
that ended in a word that did not fit the semantics of the slot was apparently no harder if that
slot was semantically very constrained). As an anonymous reviewer pointed out, this finding
can be seen as consistent with a construction-based approach. That is, while the properties
of the elements seen in a slot should affect the identification of a schema ⁄ construction at the
point of learning, once a construction has been created, an open slot in a good construction
should be able to take any word of the appropriate category. While such an explanation is
plausible, we hesitate to explain our findings in this way. We see any sharp distinction
between patterns in language that are constructions and those that are not that might be
implied in the usage-based literature as an idealization for descriptive convenience rather
than a strong claim about mental representation. We prefer to think of the learner as identi-
fying very many patterns in the input which continue to compete for utilization, with the
specific distributional properties of a pattern remaining an important part of the representa-
tion, rather than being discarded once a decision has been made to put a given schema ‘‘in
the grammar.’’ Additionally, we also suspect that the lack of a typicality effect can be
explained by aspects of our study design, as we now discuss.
Our measure of typicality was based on evaluating the similarity of individual words not
seen in the schema to individual words seen in the schema without considering the impact
of context. It is possible, then, that our manipulation of typicality was not effective because,
in creating items whose meaning matched particular observed items, we introduced a degree
of unnaturalness to the ‘‘typical’’ stimuli that may have disguised any effects. In finding
matches, we were also forced to use low-frequency words in some cases, so it could be that
the children were not that familiar with the semantics and semantic relatedness of some of
our items (e.g., the similarity of ‘‘box’’ to ‘‘case’’ and ‘‘water’’ to ‘‘liquid’’ might not be
480 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
transparent for a 2-year-old). Alternatively, in judging typicality or atypicality with refer-
ence to the particular familiar item used, we may not have picked up on confounding simi-
larities and distances from other words that can appear in the schemas. In brief, much
further work is required to develop developmentally plausible measures of semantics before
we can draw any firm conclusions about typicality effects.
Further work on semantics would also allow us to clarify the beneficial effect of semantic
density, which is potentially controversial. In the terminology of the usage-based tradition,
slot formation can be seen as an instance of category formation on the basis of functionally
based distributional analysis (c.f. Tomasello, 2003; p.124). That is, children should have
expectations concerning what words or phrases they are going to see in a particular position
based on the functions of the words that have been seen there before. This led us to predict a
positive effect of semantic density (and typicality). On the other hand, many theorists have
proposed that semantic openness would benefit productivity (e.g., Bybee, 1995; Goldberg,
2006). Our current findings of an effect of semantic density but not of typicality or of an
interaction between the two do not sit easily with either account. Further investigation will
be required to pull this apart, but the current results certainly suggest that a degree of seman-
tic coherence aids repetition even in the absence of a semantic link between the target sen-
tence and the construction semantics.
While the main focus of our study was on generalization and hence on children’s repeti-
tion of the unfamiliar items, we also asked children to repeat a single instantiation of each
schema that occurred with some frequency in our corpus and hence with which we could
expect the children to be familiar. The purpose of this was to investigate whether schema
properties affect processing even in circumstances where a sequence could in principle be
retrieved directly from memory. There was no effect of slot entropy on the repetition of
familiar items, suggesting that the children are employing a different route to production for
such items. We did, however, observe an effect of semantic density on the repetition of
familiar items, which would suggest the opposite. Further work will be required to clarify
why we see an effect of one factor but not the other. As we noted in the introduction and
explain further in the method, it could simply be that the relationship between the frequency
of the familiar string and the entropy of the schema is not a straightforward one in our stim-
uli. Further testing with more items would of course help to clarify this, but, as we now dis-
cuss, expanding our list of items is not straightforward.
In the current study, we were able to identify items that were dispersed over the range of
slot entropy and semantic density values. However, doing so left us little freedom in choos-
ing items. The fact that many of the factors that are considered to contribute to children’s
language learning are difficult to isolate in this way is not only a practical problem. It also
shows how different factors overlap in the input (sometime supporting one another and
sometimes conflicting with one another), and thus it emphasizes the gap between the kind of
idealized problems children face in artificial grammar learning experiments and those
children face in learning language. Bridging this gap will almost certainly require
conducting more experiments of the current variety. Doing so will allow us to investigate
phonological, syntactic, and semantic factors that we were not able to control in the current
study.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 481
A final limitation of the current study is because of the nature of the repetition task. It is
usually assumed that when asked to repeat an utterance children analyze the utterance and
then generate it as they would in ordinary speech. The task thus draws on comprehension
and production skills in turn. Failure to repeat the utterance might be because of difficulty
understanding it, difficulty articulating it, or both. Complementary methodologies are
required to further clarify when in processing the effects we report take hold. Alternative
methods would also allow us to explore the task specificity of the current effects. For
example, it could be that the test situation leads the child to be more conservative or more
careful than the child would be in normal speech, which might explain, for example, our
failure to find an effect of typicality.
So what are the broader implications of the current study for language learning? In previ-
ous work (Bannard & Matthews, 2008) we have shown that sequences of sounds that are
heard with little variation in the input are likely (as predicted by the many findings in the
word segmentation literature) to be identified as units of language that are candidates for
words or holophrases, with direct reuse of such sequences from the input being preferred
where available and frequent. In the present paper, we have shown that if such sequences
occur with some points of variation then the possibility of forming productive morpho-syn-
tactic slots arises and becomes more likely if slot fillers form coherent categories. Unfamil-
iar sequences that match resulting, partially abstract schemas will be processed more
fluently (c.f. Buchner, 1994; Pothos, 2007 for similar effects of fluency in artificial grammar
learning). This proposal is in line with a growing literature on ‘‘variation sets’’—successive
utterances in child-directed speech that have partial lexical overlap (Kuntay & Slobin, 1996;
Onnis, Waterfall, & Edelman, 2008). These studies suggest the effects observed in the cur-
rent study arise because many of the three-word stems will have occurred in variant forms
in quick succession in the input.
The processes of learning we have sketched here are arguably most consistent with con-
structivist approaches to language development (e.g., Edelman, 2007; Goldberg, 2006; Tom-
asello, 2003). On such accounts grammatical development occurs in a piecemeal fashion
with early knowledge consisting of sequences of words taken directly from the input with
limited generalization across forms. In the present study, we have provided evidence that
children’s ability to produce novel sequences of words can be predicted from their previous
experience with overlapping sequences, and that this holds for 3-year-olds as for 2-year-
olds. We note, however, that this does not rule out the likely possibility that children this
young might be quite adept at producing syntactic structures even in the absence of exposure
to many directly overlapping forms. Rather this finding demonstrates that children are sensi-
tive to statistical regularities in their language that are plausibly relevant to learning about
syntactic structure. We find this question of learnability more interesting than the question
of when precisely children show abstraction of a given syntactic structure (c.f. Pulvermuller
& Knoblauch, 2009 for a recent attempt at a neurally plausible account of the acquisition of
a simple combinatorial grammar where abstraction and learnability sit happily together).
We should also note that even if highly abstract syntactic structures are in principle avail-
able to the child, it is not obvious that the child should prefer to store or use them. Indeed
we would suggest that lexically specific representations are unlikely to be just a ladder to
482 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
abstract syntax, to be kicked away once learning is complete. Rather, they might be
expected to form part of any rational agent’s model of the language he or she is trying to
learn. A rational learner will want to find the model that assigns a high probability to the
exact data observed and reduces the probability of other possible sets of data (see chapter 28
of Mackay, 2003 for a detailed Bayesian approach to model comparison of this kind).
Abstract models by their very nature are less tied to particular data and can be used to gener-
ate a larger set of possible language. All else being equal, we would expect a rational learner
to reuse the input as much as possible even once he or she has acquired additional compe-
tence. Although it is by no means clear whether a psychologically plausible model of lan-
guage learning will reveal children to be rational in this sense, this might explain why we
see these kinds of lexically specific representations relatively late in development even once
more abstract representations can be expected to have emerged.
So what linguistic theories might account for our data? The idea that speakers store and
use sequences of specific words has been acknowledged by all models of syntax and is not
exclusive to usage-based accounts. All theories, after all, need to account for the presence in
language of idiomatic phrases. Where theories differ is in how such phrases fit into their
account. Early generative accounts regarded idioms as simply an extension of a lexicon that
was very much separate from the core grammatical processes, and they were argued to
obtain meaning ‘‘in the manner of a lexical item rather than as a projection from the mean-
ings of its constituents in the manner of compositional complex constituents...’’ (Katz,
1973; p. 358). It has come to be acknowledged that the kind of phrases that reoccur with fre-
quency and that appear not to be the result of a fully abstract generative process is rather lar-
ger than earlier theorists had supposed (Jackendoff, 1995). Furthermore, the distinction
between grammar and lexicon has come to be regarded as unsustainable in many contempo-
rary generative models where information about how words can combine is a part of lexical
entries, with composition occurring via uniform operations (e.g., Bresnan, 2001; Croft,
2001; Goldberg, 1995; Pollard & Sag, 1994; I. A. Sag, unpublished data; Steedman, 2000).
There has been a growing awareness that multiword sequences interact with syntactic and
semantic phenomena in a way that makes a dual-route model in which they are stored sepa-
rately untenable (e.g., Nunberg, Sag, & Wasow, 1994), and word sequences have come to
be acknowledged as integrated with core grammatical processes (e.g., Culicover & Jackend-
off, 2005; Jackendoff, 2002).
While our findings are incompatible with an account of syntactic competence which draws
a strict distinction between memory-based processing at the word level and procedural pro-
cessing for grammar (Ullman, 2001), they could, it seems, be accounted for by any model of
syntax in which sequence-specific processing is given a role. However, it is important to con-
sider that accounting for the behavior observed here requires any such theory to be somewhat
liberal in deciding what sequences will be stored. The integration of sequence or construc-
tional level representations and processes into theories of grammatical competence has been
motivated by the observation that there are sentences that cannot otherwise be accounted for.
Arguments for this have tended to be based on the syntactic or semantic nature of the phrase
and its incompatibility with general compositional or productive processes. We see no reason
to believe that the patterns which we use in our study are syntactically or semantically
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 483
idiosyncratic. The explanation for the children’s having pattern-specific representations
seems rather to be a matter purely of their distribution. This fact is easiest to accommodate
within a usage-based approach where linguistic knowledge is made up of pairings of function
with form at any point on the lexically specific ⁄ syntactically abstract continuum.
If we agree that constructions have some psychological primacy across the life span, then
this study makes a contribution in suggesting what factors would lead to their identification.
However, regardless of how we want to characterize the end point of learning, the results
here favor the acceptance of a model of syntactic competence in which lexically specific
processing plays a substantial role at the ages of 2 and 3.
Notes
1. This prediction is complicated by the fact that familiar sequences may vary in the
degree to which they are an expected completion of a known pattern. For some items
lower entropy might be especially beneficial. This would be the case if one had a
strong expectation about what would come next and the high-frequency sequence ful-
filled that expectation. However, it is possible that, although highly frequent, some
items would not be most expected for a child and for these there may be a degree to
which higher entropy is better.
2. The log likelihood ratio index indicates the proportion of the variance explained by
the more complex model that is accounted for by the predictors of interest. It can be
interpreted as a partial pseudo R2 value (see Veall and Zimmermann, 1996).
Acknowledgments
The authors would like to thank Jess Butcher, Ellie O’Malley, Manuel Schrepfer, and
Elizabeth Wills for help in data collection and coding; Harald Baayen and Roger Mundry for
statistical advice; and Bruno Estigarribia, Adele Goldberg, and Julian Pine for helpful com-
ments on the manuscript. This research was supported by postdoctoral fellowships awarded
to both authors by the Max Planck Institute for Evolutionary Anthropology, Leipzig.
References
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics. Cambridge, England:
Cambridge University Press.
Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for
subjects and items. Journal of Memory and Language, 59(4), 390–412.
Bannard, C., & Matthews, D. E. (2008). Stored word sequences in language learning: The effect of familiarity of
children’s repetition of four-word combinations. Psychological Science, 19(3), 241–248.
Braine, M. (1976). Children’s first word combinations. Monographs of the Society for Research in Child Devel-opment, 41(1), serial no. 164.
484 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
Bresnan, J. (2001). Lexical-functional syntax. Malden, MA: Blackwell.
Buchner, A. (1994). Indirect effects of synthetic grammar learning in and identification task. Journal of Experi-mental Psychology: Learning Memory, and Cognition, 20(3), 550–566.
Bybee, J. (1985). Morphology: A study of the relation between meaning and form. Amsterdam: John Benjamins.
Bybee, J. (1995). Regular morphology and the lexicon. Language and Cognitive Processes, 10(5), 425–455.
Croft, W. (2001). Radical construction grammar: Syntatic theory in typological perspective. Oxford, England:
Oxford University Press.
Culicover, P. W., & Jackendoff, R. (2005). Simpler syntax. Oxford, England: Oxford University Press.
Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and Language, 59(4),
447–456.
Edelman, S. (2007). Behavioral and computational aspects of language and its acquisition. Physics of LifeReviews, 4, 253–277.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Freudenthal, D., Pine, J. M., Aguado-Orea, J., & Gobet, F. (2007). Modelling the developmental pattern
of finiteness marking in English, Dutch, German and Spanish using MOSAIC. Cognitive Science, 31,
311–341.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel ⁄ hierarchical models. Cambridge,
England: Cambridge University Press.
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure. Chicago:
University of Chicago Press.
Goldberg, A. E. (2006). Constructions at work: The nature of generalization in language. Oxford, England:
Oxford University Press.
Gomez, R. L. (2002). Variability and detection of invariant structure. Psychological Science, 13(5), 431–436.
Gomez, R. L., & Lakusta, L. (2004). A first step in form-based category abstraction by 12-month-old infants.
Developmental Science, 7(5), 567–580.
Gomez, R. L., & Maye, J. (2005). The developmental trajectory of nonadjacent dependency learning. Infancy,
7(2), 183–206.
Hale, J. (2006). Uncertainty about the rest of the sentence. Cognitive Science, 30(4), 609–642.
Harris, Z. (1964). Distributional structure. In J. Fodor & J. Katz (Eds.), The structure of language: Readings inthe philosophy of language (pp. 33–49). Englewood Cliffs, NJ: Prentice Hall.
Hirst, G., & St-Onge, D. (1998). Lexical Chains as representations of context for the detection and correction of
malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). Cambridge,
MA: MIT Press.
Jackendoff, R. (1995). The boundaries of the lexicon. In M. Everaert, E. Van der Linden, A. Schenk, &
R. Schreuder (Eds.), Idioms: Structural and psychological perspectives (pp. 133–165). Hillsdale, NJ:
Lawrence Erlbaum Associates.
Jackendoff, R. (2002). Foundations of language: Brain, meaning, grammar, evolution. Oxford, England: Oxford
University Press.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit
mixed models. Journal of Memory and Language, 59(4), 434–446.
Johnson, E. K., & Tyler, M. D. (in press). Testing the limits of artificial language learning. DevelopmentalScience.
Katz, J. (1973). Compositionality, idiomaticity and lexical substitution. In S. Anderson & P. Kiparsky (Eds.), AFestschrift for Morris Halle (pp. 392–409). New York: Holt Rinehart and Winston.
Keller, F. (2004). The Entropy Rate Principle as a predictor of processing effort: An evaluation againsteye-tracking data. Paper presented at the Empirical Methods in Natural Language Processing, Barce-
lona.
Kempe, V., Brooks, P. J., Mironova, N., Pershukova, A., & Fedorova, O. (2007). Playing with word endings:
Morphological variation in the learning of Russian noun inflections. British Journal of DevelopmentalPsychology, 25(1), 55–77.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 485
Kuntay, A. C., & Slobin, D. (1996). Listening to a Turkish mother: Some puzzles for acquisition. In D. Slobin, J.
Gerhardt, A. Kyratis, & T. Guo (Eds.), Social interaction, social context and language: Essays in honor ofSusan Ervin-Tripp (pp. 265–286). Hillsdale, NJ: Erlbaum.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: the Latent Semantic Analysis theory of
acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126–1177.
Lieven, E. V. M., Pine, J. M., & Baldwin, G. (1997). Lexically based learning and early grammatical develop-
ment. Journal of Child Language, 24, 187–219.
Mackay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge, England:
Cambridge University Press.
Magnuson, J. S., Mirman, D., & Strauss, T. (2007). Why do neighbors speed visual word recognition but slow spo-
ken word recognition? 13th Annual Conference on Architectures and Mechanisms for Language Processing.
McFadden, D. (1974). Conditional log it analysis of qualitative choice behavior. In P. Zarembka (Ed.), Frontiersin econometrics (pp. 105–142). New York: Academic Press.
Monaghan, P., & Christiansen, M. H. (2008). Integration of multiple probabilistic cues in syntax acquisition. In
H. Behrens (Ed.), Corpora in language acquisition research (pp. 139–163). Amsterdam: Johns Benjamins.
Moscoso del Prado Martın, F., Kostic, A., & Baayen, H. (2004). Putting the bits together: An informational theo-
retic perspective on morphological processing. Cognition, 94(1), 1–18.
Nunberg, G., Sag, I. A., & Wasow, T. (1994). Idioms. Language, 70, 491–538.
Onnis, L., Waterfall, H. R., & Edelman, S. (2008). Learn locally, act globally: Learning language from variation
set cues. Cognition, 109(3), 423–430.
The Oxford English Dictionary. (1989). Available at http://www.oed.com.
Pelucchi, B., Hay, J. F., & Saffran, J. (2009). Statistical learning in a natural language by 8-month-old infants.
Child Development, 80(3), 674–685.
Pine, J. M., & Lieven, E. V. M. (1997). Slot and frame patterns and the development of the determiner category.
Applied Psycholinguistics, 18, 123–138.
Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: University of Chicago
Press.
Pothos, E. M. (2007). Theories of artificial grammar learning. Psychological Bulletin, 133(2), 227–244.
Potter, M. C., & Lombardi, L. (1990). Regeneration in the short-term recall of sentences. Journal of Memory &Language, 29(6), 633–654.
Pulvermuller, F., & Knoblauch, A. (2009). Discrete combinatorial circuits emerging in neural networks: A
mechanism for rules of grammar in the human brain? Neural Networks, 22(2), 161–172.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science,
274(5294), 1926–1928.
Shannan, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of
Illinois Press.
Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237(4820),
1317–1323.
Steedman, M. (2000). The syntactic process. Cambridge, MA: MIT Press.
Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Cambridge,
MA: Harvard University Press.
Ullman, M. T. (2001). The declarative ⁄ procedural model of lexicon and grammar. Journal of PsycholinguisticResearch, 30(1), 37–69.
Valian, V., & Aubry, S. (2005). When opportunity knocks twice: two-year-olds’ repetition of sentence subjects.
Journal of Child Language, 32(3), 617–641.
Veall, M. R., & Zimmerman, K. F. (1996). Pseudo-R2 measures for some common limited dependent vanable
models. Journal of Economic Surveys, 10(3), 241–259.
Yamamoto, M., & Church, K. W. (2001). Using suffix arrays to compute term frequency and document
frequency for all substrings in a corpus. Computational Linguistics, 27(1), 1–30.
486 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)
Ap
pen
dix
A
Lo
gfr
equ
enci
eso
fst
imu
lus
seq
uen
ces
and
thei
rco
mp
on
ent
wo
rds,
big
ram
s,an
dtr
igra
ms
ina
1.7
2m
illi
on
-wo
rdch
ild
lan
gu
age
corp
us
Seq
uen
ce
Fre
qu
ency
1st
Wo
rd
Fre
qu
ency
2n
dW
ord
Fre
qu
ency
3rd
Wo
rd
Fre
qu
ency
4th
Wo
rd
Fre
qu
ency
1st
Big
ram
Fre
qu
ency
2n
dB
igra
m
Fre
qu
ency
3rd
Big
ram
Fre
qu
ency
1st
Tri
gra
m
Fre
qu
ency
2n
dT
rig
ram
Fre
qu
ency
Bac
kin
the
bo
x4
.16
8.1
99
.80
11
.12
7.3
16
.27
8.8
76
.57
5.8
25
.46
Bac
kin
the
case
0.0
08
.19
9.8
01
1.1
24
.98
6.2
78
.87
2.0
85
.82
1.1
0
Bac
kin
the
tow
n0
.00
8.1
99
.80
11
.12
4.1
96
.27
8.8
73
.61
5.8
22
.64
Ou
to
fth
ew
ater
2.4
88
.44
9.6
21
1.1
27
.19
7.1
38
.09
5.8
96
.57
3.1
8
Ou
to
fth
eli
qu
id0
.00
8.4
49
.62
11
.12
3.5
87
.13
8.0
92
.30
6.5
70
.00
Ou
to
fth
ep
ud
din
g0
.00
8.4
49
.62
11
.12
3.3
07
.13
8.0
90
.69
6.5
70
.00
Ap
iece
of
toas
t3
.61
10
.74
6.9
49
.62
6.5
06
.03
6.5
94
.70
5.8
94
.34
Ap
iece
of
mea
t0
.00
10
.74
6.9
49
.62
3.1
46
.03
6.5
91
.10
5.8
90
.00
Ap
iece
of
bri
ck0
.00
10
.74
6.9
49
.62
4.1
16
.03
6.5
90
.69
5.8
90
.00
It’s
tim
efo
rlu
nch
2.2
09
.35
7.5
78
.88
6.5
35
.11
4.6
84
.84
4.0
12
.40
It’s
tim
efo
rso
up
0.0
09
.35
7.5
78
.88
3.0
05
.11
4.6
80
.00
4.0
10
.00
It’s
tim
efo
rd
rum
s0
.00
9.3
57
.57
8.8
82
.08
5.1
14
.68
0.0
04
.01
0.0
0
Ab
ow
lo
fco
rnfl
akes
2.2
01
0.7
46
.28
9.6
26
.19
4.5
44
.36
3.9
33
.81
2.5
6
Ab
ow
lo
fb
iscu
its
0.0
01
0.7
46
.28
9.6
25
.97
4.5
44
.36
3.2
63
.81
0.0
0
Ab
ow
lo
ffl
ow
ers
0.0
01
0.7
46
.28
9.6
26
.21
4.5
44
.36
3.5
03
.81
0.6
9
Hav
ea
nic
ed
ay2
.64
9.4
81
0.7
48
.54
7.3
27
.92
7.0
64
.50
4.7
64
.23
Hav
ea
nic
eh
ou
r0
.00
9.4
81
0.7
48
.54
4.3
67
.92
7.0
60
.00
4.7
60
.00
Hav
ea
nic
em
eal
0.0
09
.48
10
.74
8.5
45
.12
7.9
27
.06
2.0
84
.76
1.9
5
Yo
ub
um
ped
yo
ur
hea
d3
.66
11
.15
5.1
29
.43
6.8
04
.36
4.4
76
.14
4.0
44
.08
Yo
ub
um
ped
yo
ur
leg
0.0
01
1.1
55
.12
9.4
34
.89
4.3
64
.47
3.1
84
.04
0.0
0
Yo
ub
um
ped
yo
ur
toy
0.0
01
1.1
55
.12
9.4
35
.61
4.3
64
.47
3.2
64
.04
0.0
0
Wh
ata
fun
ny
no
ise
2.7
79
.70
10
.74
6.6
26
.88
6.1
65
.60
4.8
03
.33
4.6
2
Wh
ata
fun
ny
sou
nd
0.0
09
.70
10
.74
6.6
26
.08
6.1
65
.60
1.1
03
.33
0.6
9
Wh
ata
fun
ny
cup
0.0
09
.70
10
.74
6.6
26
.47
6.1
65
.60
0.0
03
.33
0.0
0
Let
’sh
ave
alo
ok
5.5
67
.80
9.4
81
0.7
49
.08
5.9
07
.92
6.8
05
.80
6.7
4
Let
’sh
ave
see
0.0
07
.80
9.4
81
0.7
48
.86
5.9
07
.92
1.6
15
.80
0.6
9
Let
’sh
ave
ath
ink
0.0
07
.80
9.4
81
0.7
49
.21
5.9
07
.92
1.3
95
.80
1.1
0
No
te.
1w
asad
ded
toal
lfr
equ
enci
esb
efo
reta
kin
gth
elo
gar
ith
ms
toac
com
mo
dat
efr
equ
enci
eso
f0
.F
or
all
mo
del
sre
po
rted
inth
eb
od
yo
fth
ep
aper
we
also
cond
uct
edal
tern
ativ
ean
aly
ses
inw
hic
hP
Cs
wer
eb
uil
tu
sin
glo
gfr
equ
enci
esin
wh
ich
we
add
edv
alu
esat
inte
rval
sb
etw
een
0.0
00
00
00
00
01
and
1.T
he
pat
tern
of
resu
lts
(th
eo
utc
om
eo
fal
lm
od
elco
mp
aris
on
s)w
asfo
un
dto
be
the
sam
ere
gar
dle
sso
fth
ev
alu
ead
ded
.
D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010) 487
Appendix B
Syntactic types of stimuli sentences
Prepositional phrases
Back in the box | case | town
Out of the water | liquid | pudding
Noun phrases
A bowl of cornflakes | biscuits | flowers
A piece of toast | meat | brick
Sentences
You bumped your head | leg | toy
What a funny noise | sound | cup
Let’s have a look | see | think
Have a nice day | hour | meal
It’s time for lunch | soup | drums
Supporting Information
Additional Supporting Information may be found in the
online version of this article:
Appendix S1: Principal components analysis for stim-
uli frequencies
Appendix S2: Checking our models for collinearity
Appendix S3: Obtaining human similarity judgments
for evaluating sequence typicality
Please note: Wiley-Blackwell is not responsible for
the content or functionality of any supporting materials
supplied by the authors. Any queries (other than missing
material) should be directed to the corresponding author
for the article.
488 D. Matthews, C. Bannard ⁄ Cognitive Science 34 (2010)