philosophical transactions of the royal society - gerhard jäger, james rogers - formal language...

16
, published 11 June 2012 , doi: 10.1098/rstb.2012.0077 367 2012 Phil. Trans. R. Soc. B Gerhard Jäger and James Rogers Formal language theory: refining the Chomsky hierarchy References http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#related-urls Article cited in: http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#ref-list-1 This article cites 15 articles, 2 of which can be accessed free Subject collections (355 articles) cognition Articles on similar topics can be found in the following collections Email alerting service here right-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. B To subscribe to on September 11, 2014 rstb.royalsocietypublishing.org Downloaded from on September 11, 2014 rstb.royalsocietypublishing.org Downloaded from

Upload: tulakalbeyo

Post on 17-Nov-2015

6 views

Category:

Documents


2 download

DESCRIPTION

linguistics-philosophy-cogsci

TRANSCRIPT

  • , published 11 June 2012, doi: 10.1098/rstb.2012.0077367 2012 Phil. Trans. R. Soc. B Gerhard Jger and James Rogers Formal language theory: refining the Chomsky hierarchy

    References

    http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#related-urls Article cited in:

    http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#ref-list-1

    This article cites 15 articles, 2 of which can be accessed free

    Subject collections (355 articles)cognition

    Articles on similar topics can be found in the following collections

    Email alerting service hereright-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top

    http://rstb.royalsocietypublishing.org/subscriptions go to: Phil. Trans. R. Soc. BTo subscribe to

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from

    http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#ref-list-1http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.html#related-urlshttp://rstb.royalsocietypublishing.org/cgi/collection/cognitionhttp://rstb.royalsocietypublishing.org/cgi/alerts/ctalert?alertType=citedby&addAlert=cited_by&saveAlert=no&cited_by_criteria_resid=royptb;367/1598/1956&return_type=article&return_url=http://rstb.royalsocietypublishing.org/content/367/1598/1956.full.pdfhttp://rstb.royalsocietypublishing.org/subscriptionshttp://rstb.royalsocietypublishing.org/http://rstb.royalsocietypublishing.org/

  • on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from Phil. Trans. R. Soc. B (2012) 367, 19561970

    doi:10.1098/rstb.2012.0077Research* Autho

    One concomputaFormal language theory: refining theChomsky hierarchy

    Gerhard Jager1,* and James Rogers2

    1Department of Linguistics, University of Tuebingen, Wilhelmstrasse 19, Tuebingen 72074, Germany2Department of Computer Science, Earlham College, Richmond, IN, USA

    The first part of this article gives a brief overview of the four levels of the Chomsky hierarchy, with aspecial emphasis on context-free and regular languages. It then recapitulates the arguments why neitherregular nor context-free grammar is sufficiently expressive to capture all phenomena in the naturallanguage syntax. In the second part, two refinements of the Chomsky hierarchy are reviewed, whichare both relevant to the extant research in cognitive science: the mildly context-sensitive languages(which are located between context-free and context-sensitive languages), and the sub-regular hierarchy(which distinguishes several levels of complexity within the class of regular languages).

    Keywords: formal language theory; complexity; artificial grammar learning1. INTRODUCTIONThe field of formal language theory (FLT)initiated byNoam Chomsky in the 1950s, building on earlier workby Axel Thue, Alan Turing and Emil Postprovides ameasuring stick for linguistic theories that sets a mini-mal limit of descriptive adequacy. Chomsky suggesteda series of massive simplifications and abstractions tothe empirical domain of natural language. (In particu-lar, this approach ignores meaning entirely. Also, allissues regarding the usage of expressions such astheir frequency, context dependence and processingcomplexity are left out of consideration. Finally, it isassumed that patterns that are productive for shortstrings apply to strings of arbitrary length in anunrestricted way. The immense success of thisframeworkinfluencing not only linguistics to thisday but also theoretical computer science and, morerecently, molecular biologysuggests that theseabstractions were well chosen, preserving the essentialaspects of the structure of natural languages.1

    An expression in the sense of FLT is simply a finitestring of symbols, and a ( formal ) language is a set ofsuch strings. The theory explores the mathematicaland computational properties of such sets. To beginwith, formal languages are organized into a nestedhierarchy of increasing complexity.

    In its classical formulation [3], this so-calledChomsky hierarchy has four levels of increasing com-plexity: regular, context-free, context-sensitive andcomputably enumerable languages. Subsequent workin formal linguistics showed that this fourfold distinc-tion is too coarse-grained to pin down the level ofcomplexity of natural languages along this domain.r for correspondence ([email protected]).

    tribution of 13 to a Theme Issue Pattern perception andtional complexity.

    1956Therefore, several refinements have been proposed.Of particular importance here are the levels thatextend the class of context-free languages (CFLs)the so-called mildly context-sensitive languagesandthose that further delimit the regular languagesthesub-regular hierarchy.

    In this article, we will briefly recapitulate thecharacteristic properties of the four classical levels ofthe Chomsky hierarchy and their (ir)relevance to theanalysis for natural languages. We will do this in asemi-formal style that does not assume any specificknowledge of discrete mathematics beyond elementaryset theory. On this basis, we will explain the motivationand characteristics of the mildly context-sensitive andthe sub-regular hierarchies. In this way, we hope togive researchers working in artificial grammar learning(AGL) an iron ration of FLT that helps them to relateexperimental work to formal notions of complexity.2. THE CHOMSKY HIERARCHYA formal language in the sense of FLT is a set ofsequences, or strings over some finite vocabulary S.When applied to natural languages, the vocabulary isusually identified with words, morphemes or sounds.2

    FLT is a collection of mathematical and algorithmictools about how to define formal languages with finitemeans, and how to process them computationally. It isimportant to bear in mind that FLT is neither con-cerned with the meanings of strings, nor with thequantitative/statistical aspects such as the frequency orprobability of strings. This in no way suggests thatthese aspects are not important for the analysis of setsof strings in the real worldthis is just not what FLTtraditionally is about (even though it is of coursepossible to extend FLTaccordinglysee 7).

    To be more specific, FLT deals with formal languages( sets of strings) that can be defined by finite means,This journal is q 2012 The Royal Society

    mailto:[email protected]://rstb.royalsocietypublishing.org/

  • Formal language theory G. Jager and J. Rogers 1957

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from even if the language itself is infinite. The standard way togive such a finite description is with a grammar. Fourthings must be specified to define a grammar: a finitevocabulary of symbols (referred to as terminals) thatappear in the strings of the language; a second finitevocabulary of extra symbols called non-terminals; aspecial designated non-terminal called the start symbol;and a finite set of rules.

    From now on, we will assume that when we refer toa grammar G we refer to a quadruple kS,NT,S,Rl,where S is the set of terminals, NT is the set of non-terminals, S is the start symbol and R is the set ofrules. Rules have the form a! b, understood as amay be replaced by b, where a and b are strings ofsymbols from S and/or NT. Application of the rulea ! b to a string means finding a substring init that is identical to a and replacing that substring by b,keeping the rest the same. Thus, applying a! b toxay produces xby.

    G will be said to generate a string w consisting ofsymbols from S if and only if it is possible to startwith S and produce w through some finite sequenceof rule applications. The sequence of modified stringsthat proceeds from S to w is called a derivation of w.The set of all strings that G can generate is called thelanguage of G, and is notated L(G).

    The question whether a given string w is generated bya given grammar G is called the membership problem. It isdecidable if there is a Turing machine (or an equivalentdevice, i.e. a computer program running on a machinewith unlimited memory and time resources) thatanswers this question with yes or no in finite time.A grammar G is called decidable if the membership pro-blem is decidable for every string of terminals of thatgrammar. In a slight abuse of terminology, a languageis called decidable if it has a decidable grammar. Aclass of grammars/languages is called decidable if andonly if all its members are decidable.

    (a) Computably enumerable languagesThe class of all languages that can be defined by someformal grammar is called computably enumerable. It canbe shown that any kind of formal, algorithmic procedurethat can be precisely defined can also be expressed bysome grammarbe it the rules of chess, the derivationsof logic or the memory manipulations of a compu-ter program. In fact, any language that can be definedby a Turing machine (or an equivalent device) iscomputably enumerable, and vice versa.

    All computably enumerable languages are semi-decidable. This means that there is a Turing machinethat takes a string w as input and outputs the answeryes if and only if w is generated by G. If w is notgenerated by G, then the machine either outputs adifferent answer or it runs forever.

    Examples of languages with this property are the setof computer programs that halt after a finite number ofsteps (simply compile the program into a Turingmachine and let it run, and then output yes if the pro-gram terminates), or the set of provable statements offirst-order logic. (A Turing machine can systematicallylist all proofs of theorems one after the other; if the lastline of the proof equals the string in question: outputyes; otherwise, move on to the next proof.)Phil. Trans. R. Soc. B (2012)(b) Context-sensitive languagesContext-sensitive grammars3 are those grammarswhere the left-hand side of each rule (a) is neverlonger than the right-hand side (b). Context-sensitivelanguages are then the languages that can be definedby some context-sensitive grammar. The definition ofthis class of grammars immediately ensures a decisionprocedure for the membership problem. Starting froma string in question w, there are finitely many ways inwhich rules can be applied backward to it. None ofthe resulting strings is longer than w. Repeating thisprocedure either leads to shorter strings or to a loopthat need not be further considered. In this way, itcan be decided in finite time whether w is derivablefrom S.

    Even though the question whether or not a givenstring w is generated by a given context-sensitive gram-mar G is in principle decidable, computing this answermay be algorithmically so complex that it is, forpractical purposes, intractable.4

    It should be noted that there are decidablelanguages that are not context-sensitive (even thoughthey do not have any practical relevance in connectionwith natural languages).

    Examples of context-sensitive languages (that are notcontext-free) are as follows (we follow the common nota-tion where xi denotes a consecutive string of symbolsthat contains exactly i repetitions of the string x):

    the set of all prime numbers (where each number nis represented by a string of length n);

    the set of all square numbers; the copy language, i.e. the set of all strings over S

    that consist of two identical halfs; anbmcndm; anbncn; and anbncnenf n.

    (c) Context-free languagesIn a context-free grammar, all rules take the form

    A ! b;

    where A is a single non-terminal symbol and b is astring of symbols.5 CFLs are those languages thatcan be defined by a context-free grammar.

    Here, the non-terminals can be interpreted asnames of syntactic categories, and the arrow ! can be interpreted as consists of . Therefore, the deri-vation of a string x in such a grammar implicitlyimposes a hierarchical structure of x into ever largersub-phrases. For this reason, context-free grammars/languages are sometimes referred to as phrase struc-ture grammars/languages, and it is assumed that suchlanguages have an intrinsic hierarchical structure.

    As hierarchical structure is inherent in manysequential phenomena in biology and culturefromproblem solving to musical structurecontext-freegrammars are a very versatile analytical tool.

    It is important to keep in mind though that a CFL(i.e. a set of strings) does not automatically comeequipped with an intrinsic hierarchical structure.There may be several grammars for the same languagethat impose entirely different phrase structures.

    http://rstb.royalsocietypublishing.org/

  • S

    a S c db

    a S c db

    a c db

    Figure 1. Phrase structure tree.

    S

    a d

    b c

    a d

    b c

    a d

    b

    T

    S

    T

    S

    T

    c

    Figure 2. Different phrase structure tree for the same stringas in figure 1.

    1958 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from This point can be illustrated with the language(ab)n(cd )n. A simple grammar for it has only two rules:

    S ! abScd; and S ! abcd.

    The derivation for the string abababcdcdcd can suc-cinctly be represented by the phrase structure treegiven in figure 1. In such a tree diagram, each localtree (i.e. each node together with the nodes below itthat are connected to it by a direct line) representsone rule application, with the node on top being theleft-hand side and the node on the bottom, the right-hand side. The sequence that is derived can be readoff the leaves (the nodes from which no line extendsdownwards) of the tree.

    The same language can also be described by asomewhat more complex grammar, using the rules:

    S ! aTd; T! bSc; and T ! bc.

    According to this grammar, the phrase structure treefor abababcdcdcd comes out as given in figure 2.

    So both grammars impose a hierarchical structureon the string in question, but these structures differconsiderably. It is thus important to keep in mindthat phrase structures are tied to particular grammarsand need not be intrinsic to the language as such.

    Natural languages often provide clues about thehierarchical structure of their sentences beyondthe plain linear structure. (Intonation, semanticcoherence, morphological agreement and relativesyntactic independence are frequently used criteriafor a sub-string to be considered a coherent hierar-chical unit.) Therefore most linguists require agrammar not just to generate the correct set of stringsto be adequate; rather, it must also assign a plausiblephrase structure.

    The membership problem for CFLs is solvable incubic time, i.e. the maximum time that is needed todecide whether a given string x belongs to L(G) forsome context-free grammar G grows with the thirdpower of the length of x. This means that there areefficient algorithms to solve this problem.Phil. Trans. R. Soc. B (2012)Examples of (non-regular) CFLs are given in theleft column of table 1. Where appropriate, minimallydiffering examples for a non-context-free language(that are all context-sensitive) are given in the rightcolumn for contrast.

    (d) Regular languagesRegular languages are those languages that are definedby regular grammars. In such a grammar, all rules takeone of the following two forms:

    A ! a;A ! aB:

    Here, A and B denote non-terminal symbols and a aterminal symbol.6

    As regular grammars are also context-free, the non-terminals can be seen as category symbols and thearrow as consists of . According to another naturalinterpretation, non-terminals are the names of thestates of an automaton. The arrow ! symbolizespossible state transitions, and the terminal on theright-hand side is a symbol that is emitted as a sideeffect of this transition. The start symbol S designatesthe initial state, and rules without a non-terminal onthe right-hand side represent transitions into the finalstate. As there are finitely many non-terminals, a regulargrammar thus describes a finite state automaton (FSA).In fact, it can be shown that each FSA can be transfor-med into one that is described by a regular grammarwithout altering the language that is being described.Therefore, regular grammars/languages are frequentlyreferred to as finite state grammars/languages.

    The membership problem for regular languages canbe solved in linear time, i.e. the recognition time growsat most proportionally to the length of the string inquestion. Regular languages can thus be processedcomputationally in a very efficient way.

    Table 2 gives some examples of regular languagesin the left column. They are contrasted to similarnon-regular (context-free) languages in the rightcolumn (figure 3).

    http://rstb.royalsocietypublishing.org/

  • Table 1. Context-free and non-context-free languages

    context-free languages non-context-free languages

    mirror language copy language(i.e. the set of strings xy over a

    given S such that y is themirror image of x)

    (i.e. the set of strings xxover a given S such thatx is an arbitrary string ofsymbols from S)

    palindrome language(i.e. the set of strings x that

    are identical to their mirrorimage)

    anbn anbncn

    anbmcmdn anbmcndm

    well-formed programs ofPython (or any other high-

    level programminglanguage)

    Dyck language(the set of well-nested

    parentheses)

    well-formed arithmeticexpression

    Table 2. Regular and non-regular languages.

    regular languages non-regular languages

    anbm anbn

    the set of strings x such thatthe number of as in x is amultiple of 4

    the set of strings x suchthat the number of asand the number of bs inx are equal

    the set of natural numbersthat leave a remainder of 3

    when divided by 5

    {an: n is Gdel number of a Peano-Theorem}

    context-free

    context-sensitive

    type-0

    a2n

    regular

    anbn

    anbm

    Figure 3. Chomsky hierarchy.

    Formal language theory G. Jager and J. Rogers 1959

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from As the examples illustrate, regular grammars areable to count up to a certain number. This numbermay be arbitrarily large, but for each regular grammarthere is an upper limit for counting. No regular gram-mar is able to count two sets of symbols and comparetheir size if this size is potentially unlimited. As aconsequence, anbn is not regular.

    The full proof of this fact goes beyond the scope ofthis overview article, and the interested reader isreferred to the literature cited. The crucial insightunderlying this proof is quite intuitive though, andwe will provide a brief sketch.

    For each regular grammar G, it is possible to con-struct an algorithm (a FSA) that reads a string fromleft to right, and then outputs yes if the stringbelongs to L(G), and no otherwise. At each point intime, this algorithm is in one of k 1 different states,where k is the number of non-terminals in G. Suppose,for a reductio ad absurdum, that L anbn is a regularlanguage, and let G* be a regular grammar that recog-nizes L and that has k* non-terminals. Then thecorresponding recognition algorithm has k* 1 differentstates. Now let i be some number more than k* 1.According to the assumption, aibi belongs to L(G).When the recognition algorithm reads in the sequencePhil. Trans. R. Soc. B (2012)of as at the beginning of the string, it will visit thesame state for the second time after at most k* 1steps. So a sequence of i consecutive as will be indistin-guishable for the algorithm from a sequence of i 2 k0

    consecutive as, for some positive k0 k* 1. Hence,if the algorithm accepts the string aibi, it will alsoaccept the string aik

    0bi. As this string does not belong

    to anbn, the algorithm does not accept all and only theelements of anbn, contra assumption. Therefore anbn

    cannot be a regular language.As mentioned above, each regular language corre-

    sponds to some FSA, i.e. an algorithm thatconsumes one symbol at a time and changes its stateaccording to the symbol consumed. As the namesuggests, such an automaton has finitely many states.Conversely, each FSA can be transformed into a regu-lar grammar G such that the automaton accepts all andonly the strings in L(G).

    The other levels of the Chomsky hierarchy like-wise each correspond to a specific class of automata.Context-free grammars correspond to FSAs that areadditionally equipped with a pushdown stack. Whenreading an input symbol, such a machine cannextto changing its stateadd an item on top of a stackor remove an item from the top of the stack.

    Context-sensitive grammars correspond to linearlybounded automata. These are essentially Turing machines,i.e. FSAs with a memory tape that can perform arbitraryoperations (writing and erasing symbols on the tape andmoving the tape in either direction) during state tran-sitions. The length of the available tape is not infinite,though, but bounded bya number that is a linear functionof the length of the input string.

    Finally, type-0 grammars correspond to unrestrictedTuring machines.3. WHERE ARE NATURALLANGUAGES LOCATED?The issue of where natural languages are located withinthis hierarchy has been an open problem over decades.Chomksky [4] pointed out already in the 1950s thatEnglish is not a regular language, and this argumentprobably carries over to all other natural languages.The crucial insight here is that English has centreembedding constructions. These are constructions

    http://rstb.royalsocietypublishing.org/

  • Neither did John claim that he neither smokes while . . . nor snores, nordid anybody believe it.

    Figure 4. Nested dependencies in English.

    dass mer d chind em Hans es Huus lnd hlfe aanstriicheTHAT WE THE CHILDREN-ACC HANS-DAT THE HOUSE-ACC LET HELP PAINT

    that we let the children help Hans paint the house

    Figure 5. Cross-serial dependencies in Swiss German.

    1960 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from involving two dependent elements a and b that are notadjacent, and that may contain another instance of thesame construction between the two parts. An exampleis a neithernor construction, as illustrated in figure 4.The pair-wise dependencies between neither and norare nested. As far as the grammar of English goes,there is no fixed upper bound on the number of levelsof embedding.7 Consequently, English grammarallows for a potentially unlimited number of nesteddependencies of unlimited size. Regular grammars areunable to recognize this kind of unlimited dependenciesbecause this involves counting and comparing. As men-tioned at the end of the previous section, regularlanguages cannot do this.

    The issue of whether all natural languages are context-free proved to be more tricky.8 It was finally settled onlyin the mid-1980s, independently by the scholars RinyHuybregts [5], Stuart Shieber [6] and ChristopherCuly [7]. Huybregts and Shieber use essentially thesame argument. They notice that the dependenciesbetween verbs and their objects in Swiss German areunbounded in length. However, they are not nested,but rather interleaved so that they cross each other. Anexample (taken from [6]) is given in figure 5.

    Here, the first in a series of three articlenounphrases (dchind the child) is the object of the firstverb, lond let (lond requires its object to be in accusativecase and dchind is in accusative); the second articlenoun phrase (em Hans, Hans, carrying dative case) isthe object of the second verb (halfe help, whichrequires its object to be in dative case) and the thirdarticlenoun phrase (es Huus the house, accusativecase) is the object of the third verb (aanstriiche paint,which requires an accusative object). In English, asshown in the glosses, each verb is directly adjacent toits object, which could be captured even by a regu-lar grammar. Swiss German, however, has crossingdependencies between objects and verbs, and thenumber of these interlocked dependencies is poten-tially unbounded. Context-free grammars can onlyhandle an unbounded number of interlocked dependen-cies if they are nested. Therefore Swiss-German cannotbe context-free. Culy makes a case that the rules ofword formation in the West-African language Bambaraconspire to create unbounded crossing dependencies aswell, thus establishing the non-context-freeness of thislanguage of well-formed words.

    Simple toy languages displaying the same structuralproperties are the copy languagewhere eachPhil. Trans. R. Soc. B (2012)grammatical string has the form ww for some arbitrarystring w, and this creates dependencies of the cor-responding symbols in the first and the second half ofthe stringand anbmcndm, where the dependenciesbetween the as and the cs include an unlimitednumber of open dependencies reaching from the bs tothe ds. Therefore, both languages are not context-free.4. MILDLY CONTEXT-SENSITIVE LANGUAGESAfter this brief recapitulation of the classical Chomskyhierarchy, the rest of the paper will review two exten-sions that have proved useful in linguistics andcognitive science. The first onedealt with in thissectionconsiders levels between context-free and con-text-sensitive languages, so-called mildly context-sensitivelanguages. The following section is devoted to thesubregular hierarchy, a collection of language classesthat are strictly included in the regular languages.

    Since the 1980s, several attempts have been made todesign grammar formalisms that are more suitable fordoing linguistics than the rewrite grammars from theChomsky hierarchy, while at the same time approxi-mating the computational tractability of context-freegrammars. The most prominent examples are Joshis[8] tree adjoining grammar (TAG) and Mark Steedmanscombinatory categorial grammar [9,10]. In 1991, Joshiet al. [11] proved that four of these formalisms (thetwo already mentioned ones and Gerald Gazdars [12]linear indexed grammars and Carl Pollards [13] headgrammars) are equivalent, i.e. they describe the sameclass of languages. A series of related attempts to furtherextend the empirical coverage of such formalisms andto gain a deeper understanding of their mathematicalproperties converged to another class of mutuallyequivalent formalisms (including David Weirs [14]linear context-free rewrite systems and set-local multi-component TAGs, and Ed Stablers [15] formalizationof Noam Chomskys [16] minimalist grammars (MGs))that describe an even larger class of formal languages.As there are no common terms for these classes, wewill refer to the smaller class as TAG languages andthe larger one as MG languages.

    The membership problem for TAG languages isO(n6), i.e. the time that the algorithm takes growswith the 6th power of the length of the string in ques-tion. Non-CFLs that belong to the TAG languagesare, for instance:

    http://rstb.royalsocietypublishing.org/

  • context-free

    context-sensitive

    TAG

    MG

    anbn

    a2n

    anbncndnen

    anbmcndm

    Figure 6. The mildly context-sensitive sub-hierarchy.

    Formal language theory G. Jager and J. Rogers 1961

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from anbmcndm; the copy language; anbncn; and anbncndn.

    The descriptive power of TAG languages is sufficientto capture the kind of crossing dependencies that areobserved in Swiss German and Bambara.9

    MGs (and equivalent formalisms) are still morepowerful than that. While TAG languages may onlycontain up to four different types of interlocked unlim-ited (crossing or nesting) dependencies, there is nosuch upper bound for MG languages. To be more pre-cise, each MG language has a finite upper bound forthe number of different types of dependencies, butwithin the class of MG languages this bound may bearbitrarily large. This leads to a higher computationalcomplexity of the membership problem. It is stillalways polynomial, but the highest exponent of theterm may be arbitrarily large.

    Becker et al. [18] argue that this added complexityis actually needed to capture all aspects of wordorder variation in German.

    Non-TAG languages that are included in the MGlanguages are, for instance,

    a1n . . . ak

    n for arbitrary k and wk for arbitrary k, i.e. the k-copy language for

    any k.

    Joshi [8] described a list of properties that an exten-sion of the CFLs should have if it is to be of practicaluse for linguistics:

    It contains all CFLs. It can describe a limited number of types of cross-

    serial dependencies. Its membership problem has polynomial

    complexity. All languages in it have constant growth property.

    With regard to the last property, let L be some formallanguage, and let l1, l2, l3, . . . be the strings in L,ordered according to length. L has the constantgrowth property if there is an upper limit for the differ-ence in length between two consecutive elements ofthis list. The motivation for this postulate is that ineach natural language, each sentence can be extendedto another grammatical sentence by adding a singleword (like an adjective or an adverb) or anothershort conjunct. Hence there cannot be arbitrarilylarge gaps in the list of possible lengths the sentencesof a language can have.

    This postulate excludes many context-sensitivelanguages, such as the set of square numbers, the setof prime numbers or the set of powers of 2, etc.

    Joshi calls classes of languages with these propertiesmildly context-sensitive because they extend the CFLs,but only slightly, preserving many of the nice featuresof the CFLs. Both TAG languages and MG languagesare mildly context-sensitive classes in this sense.

    The refinement of the Chomsky hierarchy thatemerges from this line of research is displayed infigure 6.Phil. Trans. R. Soc. B (2012)It should be noted that Michaelis & Kracht [19]present an argument that Old Georgian is not anMG language. This conclusion only follows, though,if a certain pattern of complex case marking of thislanguage is applicable recursively without limit. Ofcourse, this issue cannot be settled for a deadlanguage, and so far the investigation of livinglanguages with similar properties has remained incon-clusive. Most experts therefore assume at this time thatall natural languages are MG languages.5. COGNITIVE COMPLEXITYClasses of the Chomsky hierarchy provide a measureof the complexity of patterns based on the structureof the mechanisms (grammars, automata) that can dis-tinguish them. But, as we observed in 2c, thesemechanisms make judgements about strings in termsof specific analyses of their components. When dealingwith an unknown mechanism, such as a cognitivemechanism of an experimental subject, we know noth-ing about the analyses they employ in making theirjudgements, we know only that they can or cannotmake these judgements about strings correctly.

    The question for AGL, then, is what characteristicsof the formal mechanisms are shared by the physicalmechanisms employed by an organism when it islearning a pattern. What valid inferences may onemake about the nature of an unknown mechanismthat can distinguish the same sorts of patterns?

    Here, the grammar- and automata-theoretic charac-terizations of the Chomsky hierarchy are much lessuseful. As we saw in 2d and 4, mechanisms withwidely differing natures often turn out to be equivalentin the sense that they are capable of describing exactlythe same class of languages.10

    Mechanisms that can recognize arbitrary CFLsare not limited to mechanisms that analyse the stringin a way analogous to a context-free grammar.Dependency grammars [20], for example, analyse astring in terms of a binary relation over its elements,and there is well-defined class of these grammarsthat can distinguish all and only the CFLs. In learninga CFL, it is not necessary to analyse it in terms

    http://rstb.royalsocietypublishing.org/

  • 1962 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from of immediate constituency as it is formalized bycontext-free grammars.

    Similarly, within each of the levels of the Chomskyhierarchy there are classes of languages that do notrequire the full power of the grammars associatedwith that level. The language anbn, for example,while properly context-free, can be recognized by aFSA that is augmented with a simple counter. Thelanguages anbmcmdn and anbn with explicit nesteddependencies cannot. On the other hand, these canstill be recognized by mechanisms that are simplerthan those that are required to recognize CFLsin general.

    So what can one say about a mechanism that canlearn a properly context-free pattern? For one thing,it is not finite-state: that is, there is no bound, indepen-dent of the length of the string, on the quantity ofinformation that it must infer in making a judgementabout whether the string fits the pattern. Beyondthat, there is very little if anything that we can deter-mine about the nature of that information and how itis used simply from the evidence that an organismcan learn the pattern.11

    The situation is not hopeless, however. No matterhow complicated the information inferred by a mech-anism in analysing a string, it must be based onrecognizing simple patterns that occur in the stringitself. One can, for example, identify the class of pat-terns that can be recognized simply on the basis ofthe adjacent pairs of symbols that occur in the string.Any mechanism that is based, in part, on that sort ofinformation about the string will need at least to beable to distinguish patterns of this sort.

    In 6, we introduce a hierarchy of language-theor-etic complexity classes that are based on this sort ofdistinction: what relationships between the symbolsin the string a mechanism must be sensitive to (toattend to) in order to distinguish patterns in thatclass. Since they are based solely on the relationshipsthat are explicit in the strings themselves, these classesare fundamental: every mechanism that can recognizea pattern that is properly in one of these classes mustnecessarily be sensitive to the kinds of relationshipsthat characterize the class.

    On the other hand, the fact that they are defined interms of explicit relationships in the string itself alsoimplies that they are all finite-state. But they stratifythe finite-state languages in a way that provides ameasure of complexity that is independent of thedetails that may vary between mechanisms that canrecognize a given pattern, one that does not dependon a priori assumptions about the nature of the mech-anism under study. Because this is a notion ofcomplexity that is necessarily relevant to cognitivemechanisms, and because the relative complexity ofpatterns is invariant over the range of equivalent mech-anisms (a property not shared by measures like, forexample, minimum description length), it provides auseful notion of cognitive complexity.

    This is a notion of complexity that is particularlyuseful for AGL: the patterns are relatively simple,and are therefore relatively practical to test and theyprovide information about the capabilities of the organ-isms that is relevant, regardless of what additionalPhil. Trans. R. Soc. B (2012)capabilities it may have that enable it to learn morecomplicated patterns.

    There are analogous hierarchies of classes that arebased on relationships that are explicit in trees and inmore complicated tree-like structures that stratify theCFLs and a range of mildly context-sensitive languages[21]. These, do, however, apply only to mechanismsthat analyse strings in terms of tree-like partial orders.

    In 6, we survey a hierarchy of complexity classesthat is based on adjacency within a string, the so-called local hierarchy [22]. There is a parallel hierarchythat is based on precedence (over arbitrary distances)that distinguishes long-distance relationships withinthe string, including many that are relevant to abroad range of aspects of human languagesincludingsome, but clearly not all, long-distance relationships insyntax. More details of this hierarchy can be foundin [23].6. SUBREGULAR LANGUAGESA subregular language is a set of strings that can bedescribed without employing the full power of FSAs.Perhaps a better way of thinking about this is thatthe patterns that distinguish the strings that areincluded in the language from those that are not canbe identified by mechanisms that are simpler thanFSAs, often much simpler.

    Many aspects of human language are manifestly sub-regular, including most local dependencies and manynon-local dependencies as well. While these pheno-mena have usually been studied as regular languages,there are good reasons to ask just how much processingpower is actually needed to recognize them. In compara-tive neurobiology, for example, there is no reason toexpect non-human animals to share the full range ofcapabilities of the human language faculty. Even withinhuman cognition, if one expects to find modularity inlanguage processing, then one may well expect to finddifferences in the capabilities of the cognitive mechan-isms responsible for processing the various modules.Similarly, in cognitive evolution one would not generallyexpect the antecedents of the human language faculty toshare its full range of cognitive capabilities; we expectcomplicated structures to emerge, in one way or another,from simpler ones.

    The hierarchy of language classes we are exploringhere are characterized both by computational mechan-isms (classes of automata and grammars) and bymodel-theoretic characterizations: characterizations interms of logical definability by classes of logical expres-sions. The computational characterizations provide uswith the means of designing experiments: developingpractical sets of stimuli that allow us to probe the abilityof subjects to distinguish strings in a language in a givenclass from strings that are not in that language and whichsuffice to resolve the boundaries of that class. Themodel-theoretic characterizations, because of theirabstract nature, allow us to draw conclusions that arevalid for all mechanisms which are capable of makingthose distinctions. It is the interplay between these twoways of characterizing a class of languages that providesa sound foundation for designing AGL experiments and

    http://rstb.royalsocietypublishing.org/

  • a b a b a b a b a b a ba b a b a

    a

    a

    a

    b

    b

    k

    a bk

    b

    k

    f

    :

    Figure 7. Scanners have a sliding window of width k, a par-ameter, which moves across the string stepping one symbolat a time, picking out the k-factors of the string. For SLlanguages the string is accepted if and only if each of thesek-factors is included in a look-up table.

    Formal language theory G. Jager and J. Rogers 1963

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from interpreting their results. Both types of characterizationsare essential to this enterprise.(a) Strictly local languagesWe begin our tour of these language classes at thelowest level of complexity that is not limited tolanguages of finitely bounded size, patterns whichdepend solely on the blocks of symbols which occurconsecutively in the string, with each block being con-sidered independently of the others. Such patterns arecalled strictly local (SL).12

    An SLk definition is just a set of blocks of k adjacentsymbols (called k-factors) drawn from the vocabularyaugmented with two symbols, D and o, denotingthe beginning and end of the string, respectively.A string satisfies such a description if and only ifevery k-factor that occurs in the string is licensedby the definition. The SL2 description GABn fDA;AB;BA;Bog, for example, licenses the set of stringsof the form (AB)n.13

    Abstract processing models for local languagesare called scanners (figure 7). For strictly k-locallanguages, the scanner includes a look-up tableof k-factors. A string is accepted if and only if everyk-factor which occurs in the string is included in thelook-up table. The look-up table is formally identicalto an SLk description. These automata have nointernal state. Their behaviour, at each point in thecomputation, depends only on the symbols which fallwithin the window at that point. This implies thatevery SLk language will be closed under substitutionof suffixes in the sense that, if the same k-factoroccurs somewhere in two strings that are in thelanguage, then the result of substituting the suffix,starting at that shared k-factor, of one for the suffixof the other must still be in the language.

    Both the SLk descriptions and the strictly k-localscanners are defined solely in terms of the length kblocks of consecutive symbols that occur in thestring, taken in isolation. This has a number ofPhil. Trans. R. Soc. B (2012)implications for cognitive mechanisms that can recog-nize SL languages:

    Any cognitive mechanism that can distinguishmember strings from non-members of an SLklanguage must be sensitive to, at least, the lengthk blocks of consecutive symbols that occur in thepresentation of the string.

    If the strings are presented as sequences of symbolsin time, then this corresponds to being sensitive, ateach point in the string, to the immediately priorsequence of k 2 1 symbols.

    Any cognitive mechanism that is sensitive only tothe length k blocks of consecutive symbols in thepresentation of a string will be able to recognizeonly SLk languages.

    Note that these mechanisms are not sensitive to thek-factors which do not occur in the string.

    (b) Probing the SL boundaryIn order to design experiments testing an organismsability to recognize SL languages, one needs a way ofgenerating sets of stimuli that sample languages thatare SL and sets that sample languages that are mini-mally non-SL. (We return to these issues in 9.)This is another place in which computational charac-terizations of language classes are particularly useful.The language of strings of alternating symbols (e.g.As and Bs: (AB)n), for example, is SL2. Mechan-isms that are sensitive to the occurrence of length 2blocks of consecutive symbols are capable, in prin-ciple, of distinguishing strings that fit such aconstraint (e.g. ABij1, for some i and j ) fromthose that do not (e.g. (AB)iAA(AB)j ). The ability todo so can be tested using sets of strings that matchthese patterns.14

    Conversely, the language of strings in which somesymbol (e.g. B) is required to occur at least once isnot SLk for any k. (We refer to this language assome-B.) Mechanisms that are sensitive only to theoccurrence of fixed size blocks of consecutive symbolsare incapable of distinguishing strings that meet such aconstraint from those that do not. Thus these organ-isms would not, all other things being equal,recognize that stimuli of the form Aij1 do notbelong to a language correctly generalized from setsof stimuli of the form AiBAj.

    (c) Locally k-testable languagesNotice that if the B were forbidden rather thanrequired, the second pattern would be a SL (evenSL1) property. So we could define a property requiringsome B as the complementthe language that containsall and only the strings that do not occur in thatlanguageof an SL1 language. In this case, we takethe complement of the set of strings in which B doesnot occur. If we add complement to our descriptions,it turns out that our descriptions will be able to expressall Boolean operators: conjunction (and), disjunction(or) and negation (not) in any combination.

    In order to do this, we will interpret k-factors asatomic (unanalysed) properties of strings; a stringsatisfies a k-factor if and only if that factor occurs

    http://rstb.royalsocietypublishing.org/

  • a a b b

    babababa

    yes

    no

    accept

    rejectb

    abbaaaba

    ba

    b

    Booleannetwork

    a b a b a b a b a

    Figure 8. LT automata keep a record of which k-factorsoccur in a string and feed this information into a Boolean

    network. The automaton accepts if, once the entire stringhas been scanned, the output of the network is yes, andrejects otherwise.

    1964 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from somewhere in the string. We can then build des-criptions as expressions in a propositional logic overthese atoms. We refer to formulae in this logic ask-expressions. A k-expression defines the set of all(and only) strings that satisfy it. A language that isdefinable in this way is a locally k-testable ( LTk)language. The class of languages that are definableby k-expressions for any finite k is denoted LT.

    By way of example, we can define the set of stringswhich do not start with A and contain at least oneB with the 2-expression: (:DA)^B.

    Note that any SLk definition G can be translatedinto a k-expression which is the conjunction:f1 ^ :f2 ^ in which the fi are the k-factors whichare not included in G. SLk definable constraints are,in essence, conjunctions of negative atomic constraintsand every such constraint is LTk definable: SL is aproper subclass of LT.

    A scanner for an LTk language contains, instead ofjust a look-up table of k-factors, a table in which itrecords, for every k-factor over the vocabulary,whether or not that k-factor has occurred somewherein the string. It then feeds this information into a com-binatorial (Boolean) network which implements somek-expression. When the end of the string is reached,the automaton accepts or rejects the string dependingon the output of the network. (See figure 8.)

    Since an LTk scanner records only which k-factorsoccur in the string, it has no way of distinguishingstrings which are built from the same set of k-factors.Hence, a language is LTk if and only if there isno pair of strings, each comprising the same set ofk-factors, one of which is included in the languageand the other excluded.

    From a cognitive perspective, then:

    Any cognitive mechanism that can distinguishmember strings from non-members of an LTklanguage must be sensitive, at least, to the set oflength k contiguous blocks of symbols that occurin the presentation of the stringboth those thatdo occur and those that do not.

    If the strings are presented as sequences of symbolsin time, then this corresponds to being sensitive, atPhil. Trans. R. Soc. B (2012)each point in the string, to the set of length k blocksof symbols that occurred at any prior point.

    Any cognitive mechanism that is sensitive onlyto the occurrence or non-occurrence of length kcontiguous blocks of symbols in the presentationof a string will be able to recognize only LTklanguages.

    One of the consequences of the inability of k-expressions to distinguish strings which comprise thesame set of k-factors is that LT languages cannot, ingeneral, distinguish strings in which there is a singleoccurrence of some symbol from those in whichthere are multiple occurrences: the stringsDA A|fflfflffl{zfflfflffl}

    k1

    B A A|fflfflffl{zfflfflffl}

    k1

    o and DA A|fflfflffl{zfflfflffl}

    k1

    B A A|fflfflffl{zfflfflffl}

    k1

    B A A|fflfflffl{zfflfflffl}

    k1

    o

    comprise exactly the same set of k-factors. Conse-quently, no mechanism that is sensitive only to theset of fixed size blocks symbols that occur in a stringwill be able, in general, to distinguish strings with asingle instance of a symbol from those with morethan one.(d) Probing the LT boundaryThe language of strings in which some block of k con-tiguous symbols is required to occur at least once (e.g.some-B of 6b, for which any k 1 will do) is LTk.Mechanisms which are sensitive to the set of fixedlength blocks of consecutive symbols which haveoccurred are capable, in principle, of distinguishingstrings that meet such a constraint (e.g. AiBAj) fromthose that do not (e.g. Aij1). Again, these patternsform a basis for developing sets of stimuli that provideevidence of an organisms ability to make thesedistinctions.

    Conversely, the language of strings in which someblock of k contiguous symbols is required to occurexactly once (e.g. one-B, in which exactly one Boccurs in every string) is not LTk for any k. Mechan-isms that are sensitive only to the set of fixed lengthblocks of consecutive symbols which have occurredare incapable of distinguishing strings that meet sucha constraint from those that do not. Thus sets ofstimuli generated by the patterns AiBAjk1 (in theset one-B) and AiBAjBAk (not in that set) can beused to probe whether an organism is limited todistinguishing sets of strings on this basis.(e) FO(+1) definability: LTTThis weakness of LT is, simply put, an insensitivity toquantity as opposed to simple occurrence. We canovercome this by adding quantification to our logicaldescriptions, that is, by moving from propositionallogic to a first-order logic which we call FO(1).Logical formulae of this sort make (Boolean combi-nations of) assertions about which symbols occur atwhich positions (s(x), where s is a symbol in the voca-bulary and x is a variable ranging over positions in astring), about the adjacency of positions (x / y, whichasserts that the position represented by y is the suc-cessor of that represented by x) and about theidentity of positions (x y), with the positions beingquantified existentially (9) or universally (8). This

    http://rstb.royalsocietypublishing.org/

  • a a bbbababab

    yes

    no

    accept

    reject

    aba

    ab

    b

    bab

    aa

    Booleannetwork

    f

    b

    a b a a a b a b ab

    Figure 9. LTT automata count the number of k-factors thatoccur in a string up to some bound and feed this information

    into a Boolean network. The automaton accepts if, once theentire string has been scanned, the output of the network isYes, and rejects otherwise.

    Formal language theory G. Jager and J. Rogers 1965

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from allows us to distinguish, for example, one occurrenceof a B from another:

    wOne-B 9xBx ^ :9yBy ^ :x yThis FO(+1) formula requires that there is some

    position in the string (call it x) at which a B occursand there is no position ( y) at which a B occurs thatis distinct from that (:x y).

    In this example, we have defined a propertyexpressed in terms of the occurrence of a singlesymbol B, but, using the successor relation, we couldjust as well be restricting the number of occurrencesof any k-factor, for some fixed k. Moreover, usingmultiple levels of quantification, we can distinguisharbitrarily many distinct positions, but, since a singleformula can only contain a fixed number of quanti-fiers, there is a fixed finite bound on the number ofpositions a given formula can distinguish. HenceFO(1) formulae can, in essence, count, but onlyup to some fixed threshold. Note that the fixedthreshold is compatible with subitization as well asactual counting.

    The class of FO(1) definable languages is charac-terized by what is known as local threshold testability(LTT). LTT automata extend LT automata by count-ing the number of occurrences of each k-factor, withthe counters counting up to a fixed maximum andthen simply staying at that value if there are additionaloccurrences. (See figure 9.)

    This gives us a cognitive interpretation of LTT:

    Any cognitive mechanism that can distinguishmember strings from non-members of an LTTlanguage must be sensitive, at least, to the multi-plicity of the length k blocks of symbols, for somefixed k, that occur in the presentation of thestring, distinguishing multiplicities only up tosome fixed threshold t.

    If the strings are presented as sequences of symbolsin time, then this corresponds to being able countup to some fixed threshold.

    Any cognitive mechanism that is sensitive only tothe multiplicity, up to some fixed threshold (and,in particular, not to the order) of the length kPhil. Trans. R. Soc. B (2012)blocks of symbols in the presentation of a stringwill be able to recognize only LTT languages.

    (f) Probing the LTT boundaryThe language of strings in which some block of k con-tiguous symbols is required to occur exactly once (e.g.one-B, for which any k and t 1 will do) is LTTk,t.Mechanisms which are sensitive to the multiplicity,up to some fixed threshold, of fixed length blocks ofconsecutive symbols which have occurred are capable,in principle, of distinguishing strings that meet such aconstraint (e.g. AiBA jk1) from those that do not(e.g. AiBAjBAk).

    Conversely, the language of strings in which someblock of k contiguous symbols is required to occurprior to the occurrence of another (e.g. no-B-after-C,in which no string has an occurrence of C that precedesan occurrence of B, with As freely distributed) is notLTTk,t for any k or t. Mechanisms that are sensitiveonly to the multiplicity, up to some fixed boundary, ofthe occurrences of fixed length blocks of consecutivesymbols are incapable of distinguishing strings thatmeet such a constraint from those that do not. Sets ofstimuli that test this ability can be based on the patternsAiBAjCAk, AiBAjBAk and AiCAjCAk, all of whichsatisfy the no-B-after-C constraint, and AiCAjBAk,which violates it.

    (g) FO(

  • S

    S0

    S3 S4

    S1 S2

    Sd

    S

    T

    VX

    P

    T

    V

    XP

    TP TT VX VVTPP TPT TTX TTSTXX TXV TSVXX VXV VVP VVSVPX VPS VSPPP PPT PTX PTSPXX PXV PSXXX XXV XVP XVS

    3-factors

    Figure 10. Rebers [30] grammar.

    1966 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from of the length k blocks that have occurred at anyprior point.

    Any cognitive mechanism that is sensitive only tothe set of fixed length sequences of length kblocks of symbols in the presentation of a stringwill be able to recognize only SF languages.

    7. STATISTICAL MODELS OF LANGUAGEMany studies of AGL have focused on statistical learn-ing [2527]. Language models which are based on theprobability of a given symbol following another areMarkov processes [28]. These can be interpreted asFSAs with transition probabilities where the under-lying FSA recognizes an SL2 language. n-gram andn-factor are equivalent concepts; in general, ann-gram model is a weighted version of a FSA thatrecognizes an SLn language; an n-gram model of alanguage (an (n 2 1)st-order Markov model) is astrictly n-local distribution.

    Statistical models of language are not directly com-parable to the sets of strings of traditional FLT, butthere is a clear distinction between SL languages andn-gram models in which probabilities are not pre-served under substitution of suffixes. Nevertheless, alanguage learner that infers probabilities of n-gramsmust be able to distinguish n-grams. In other words,it must attend to the n-factors of the input. Thus,the notion of cognitive complexity that we havedeveloped here is still relevant.

    Each of the levels of the hierarchy corresponds toa class of statistical distributions. The number ofparameters, though, rises rapidlythe number of par-ameters of an LTk distribution is exponential in k. Inapplication, the higher complexity models are likely tobe infeasible. On the other hand, there is a complexityhierarchy that parallels the local hierarchy but which isbased on precedenceorder of symbols independentof the intervening material [23]which also provides abasis for statistical models. The strictly piecewise distri-butions, those analogous to n-gram models, are bothfeasible and are capable of discriminating many long-distance dependencies [29]. The question of whethera learner can attend to subsequences (with arbitraryintervening material) in addition to or rather thansubstrings (factors) is significant.8. SOME CLASSIC ARTIFICIAL GRAMMARSSome of the earliest AGL experiments were conductedby Reber [30]. These were based on the grammarPhil. Trans. R. Soc. B (2012)represented by the FSA in figure 10. This automatonrecognizes an SL3 language, licensed by the set of3-factors also given in the figureto learn a languageof this form, the subject need only attend to theblocks of three consecutive symbols occurring inthe strings, recognizing an exception when one of theforbidden blocks occurs.

    Saffran et al. [26] presented their test subjects withcontinuous streams of words in which word bound-aries were indicated only by the transitionalprobabilities between syllables. In general, this wouldgive an arbitrary SL2 distribution, but in this casethe probabilities of the transitions internal to thewords is 1.0 and all transitions between words areequiprobable.16 Under these circumstances, thelanguage is the same as that of the SL2 automatonon which the distribution is basedi.e. this languageis simply a strictly 2-local language. It should benoted, though, that from the perspective of cognitivecomplexity we have presented here this is a distinctionwithout a difference. Whether the language is a non-trivial statistical model or not, to learn it the subjectsneed only attend to the pairs of adjacent syllablesoccurring in the stimulus.

    Marcus et al. [31] specifically avoided prediction bytransitional probabilities by testing their subjects withstrings generated according to the training pattern, butover a novel vocabulary. Gomez & Gerken [32] used asimilar design. In the latter case, the grammars theyused are similar to that of Reber and also license SL3languages. Marcus et al. limited their stimuli to exactlythree syllables in order to eliminate word length as apossible cue. In general, every language of three syllablewords is trivially SL4. The focus of the experiments,though, were strings distinguished by where and if sylla-bles were repeated (i.e. ABA versus AAB versus ABB).Languages in which no syllable is repeated are simplySL2; those in which either the first pair of syllables, thelast pair of syllables and/or the first and last syllable arerepeated are SL3. In all of these cases, the languagecan be distinguished by simply attending to blocks ofconsecutive syllables in isolation.

    Finally, Saffran [27] used a language generated bya simple context-free grammar in which there is noself-embeddingno category includes a constituentwhich is of the same category. Hence, this languageis also finite and trivially SL. Again, though, the exper-iments focused on the ability of the subjects to learnpatterns within these finite bounds. In this case,there were two types of patterns. In the first type, aword in some category occurs only in the presence of

    http://rstb.royalsocietypublishing.org/

  • Formal language theory G. Jager and J. Rogers 1967

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from a word in another specific category. In the second typea word in some category occurs only when a word of aspecific category occurs somewhere prior to it in theword. These are both non-strictly local patterns. Thefirst is LT1learning this pattern requires attentionto the set of syllables that occur in the word. Thesecond is strictly 2-piecewise testable, at the lowestlevel of the precedence-based hierarchy. Learning itrequires attention to the set of pairs of syllables inwhich one occurs prior to the other in the word,with arbitrary intervening material.

    Many recent AGL experiments have employed pat-terns of the types (AB)n (repeated AB pairs) andAnBn (repeated As followed by exactly the samenumber of Bs, sometimes with explicit pairing ofAs and Bs either in nested order or in cross-serialorder: first A with first B, etc.) As noted in 6a,(AB)n is strictly 2-local, the simplest of the complexityclasses we have discussed here. AnBn is properly con-text-free, with or without explicit dependencies. Allautomata that can recognize non-finite-state languagescan be modelled as FSAs with some sort of additionalunbounded memory (one or more counters, a stack, atape, etc.) The question of what capabilities arerequired to recognize properly CFLs, then, is aquestion of how that storage is organized.

    As noted in 5, AnBn without explicit dependenciescan be recognized by counting As and Bs (with asingle counter), a simpler mechanism than that requiredto recognize CFLs in general. AnBn with explicit nesteddependencies cannot be recognized using a single coun-ter, but it is a linear CFL,17 also simpler than the class ofCFLs in general. The question of whether there aredependencies between the As and Bs is anotherissue that generally depends on knowing the way thatthe strings are being analysed. But it is possible tomake the distinction between languages that can berecognized with a single counter and those that are prop-erly linear CFLs without appealing to explicitdependencies by using the language AnBmCmDn.18 Ifthe dependencies between the As and Bs are cross-serial, then in AnBn is properly non-context-free.A language that makes the same distinction withoutexplicit dependencies is AnBmCnDm.

    The difficulty of identifying which type of struc-ture is being used by a subject to recognize a givennon-regular pattern in natural languages delayedconfirmation that there were human languages thatemployed cross-serial dependencies for decades [57,34,35]. In AGL experiments, one has the advantage ofchoosing the pattern, but the disadvantage of nothaving a priori knowledge of which attributes of thesymbols are being distinguished by the subjects. Thefact that a particular A is paired with a particularB means that those instances must have a differentcharacter than other instances of the same category.Indeed, these patterns can be represented as AnAn

    with explicit dependencies of some sort between theAs in the first half of the string and those in thesecond half. Hence, most artificial grammars of thissort are actually more complicated versions ofAnBmCmDn or AnBmCnDm. There seems to be littleadvantage to using patterns in which the dependenciesare less explicit than these.Phil. Trans. R. Soc. B (2012)9. DESIGNING AND INTERPRETING ARTIFICIALGRAMMAR LEARNING EXPERIMENTSAll of this gives us a foundation for exploring AGL exper-iments from a formal perspective. We will considerfamiliarization/discrimination experiments. We will usethe term generalize rather than learn or become famil-iarized with and will refer to a response that indicatesthat a string is recognized as an exception as surprise.

    Let us call the set generated by the artificial gram-mar we are interested in I, the intended set. Thesubject is exposed to some finite subset of this,which we will call F, the familiarization set. It thengeneralizes to some set (possibly the same setthegeneralization may be trivial). Which set they general-ize to gives evidence of the features of F the subjectattended to. An error-free learner would not neces-sarily generalize to I, any superset of F is consistentwith their experience. We will assume that the set thesubject generalizes to is not arbitraryit is notrestricted in ways that are not exhibited by Fandthat the inference mechanism exhibits some sort ofminimalityit infers judgements about the stimulithat are not in F as well as those that are.19

    We then present the subject with a set which we willcall D, the discrimination set, which includes somestimuli that are in I and some which are not, andobserve which of these are treated by the subject asfamiliar and which are surprising. One can draw con-clusions from these observations only to the extentthat D distinguishes I from other potential generaliz-ations. That is, D needs to distinguish all supersetsand subsets of I that are consistent with (i.e. supersetsof ) F.

    Figure 11 represents a situation in which we are test-ing the subjects ability to recognize a set that isgenerated by the pattern I AnBn, in which a block ofAs is followed by block of Bs of the same length, andwe have exposed the subject to stimuli of the form F AABB. The feature that is of primary interest is thefact that the number of As and Bs is exactly thesame, a constraint that is properly context-free.

    The innermost circle encompasses F. The boldcircle delimits the intended set I. The other circlesrepresent sets that are consistent with F but whichgeneralize on other features. For example, a subjectthat identifies only the fact that all of the As precedeall of the Bs would generalize to the set we have calledAnBm. This set is represented by the middle circle onthe right-hand side of the figure and encompasses I.A subject that identifies, in addition, the fact that allstimuli in F are of even length might generalize tothe set we label AnBeven

    m . This is represented by theinner circle on the right-hand side.

    It is also possible that the subject has learned thatthere are at least as many As as there are Bs(AnBnm) or v.v. These sets include I and that portionof AnBm above (resp., below) the dotted line. Alterna-tively, if the subject has generalized from the fact thatthe number of As is equal to the number of Bs butnot the fact that all of the As precede all of the Bs,they might generalize to the set we label ABequal.Included in this set, as a proper subset, is the language(AB)n, which is not consistent with F but does sharethe feature of the As and Bs being equinumerous.

    http://rstb.royalsocietypublishing.org/

  • F

    AnBn n < 2

    AABB

    (AB)n

    AAB

    ABB

    IAnBn

    I = {AnBn|n > 1}

    {A,B}*

    AnBmevenAnBmABequal

    An+mBm

    AnBn+m

    F = {AABB} D = ?

    Figure 11. Language inference experiments.

    1968 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from It is also possible that the subject makes the mini-mal generalization that the number of As is finitelybounded. The smallest of these sets consistent withF is the set we have labelled AnBn, n 2.

    These, clearly, are not all of the sets the subjectmight consistently infer from F but they are a reason-able sample of principled generalizations that we needto distinguish from I. Note that, in this case, thepotential generalizations span the entire range ofclasses from SL2 (A

    nBm), through CF (AnBnm andABequal). If we are to distinguish, say, the boundarybetween finite state and context-free our probes willneed to distinguish these. To do that, D must includestimuli in the symmetric differences of the potentialgeneralizations (including I), that is to say stimulithat are in one or the other but not both.

    For example, suppose we test the subject withstrings of the form AAB. These are not in AnBn (prop-erly CF) but are in the set AnBm (SL), so subjects thathave generalized to the former will generally find themsurprising while those that have generalized to thelatter will not. On the other hand, they are also notin the set AnBeven

    m , which is finite state, but are inAnmBm, which is properly context-free. Thus thesestimuli do not resolve the finite state boundary.

    Suppose, then, that we test with strings of the formAAAB. These, again, are not in (CF) but are in AnBm

    (SL). But they are also not in both of AnBevenm (finite

    state) and AnmBm (properly context-free). Again,they do not resolve the finite state boundary.

    One can distinguish the finite state from the non-finitestate languages in this particular set of potential general-izations if one includes strings of both of these forms inD, but that still will not distinguish AnBn fromAnBnm, which is presumably not significant here butmay well be in testing other hypotheses. These can bedistinguished by including strings of the forms that cor-respond to these but include more Bs than As.

    None of these, though, distinguish a learner thathas generalized to a finite set (AnBn, n 2). To get evi-dence that the learner has done this, one needs toinclude strings of length greater than four.

    One ends up with a discrimination set that includesat least five variants of the pattern AnBm for differentvalues of n and m between two and six. This seemsto be very near the boundary of practicality for mostPhil. Trans. R. Soc. B (2012)experiments involving living organisms. There aretwo ways that one might resolve this limitation: onecan find experiments which can distinguish perform-ance on stimuli of this size, perhaps not being able todraw any conclusions for some types of subject, orone can refine ones hypothesis so that it is practicallytestable. In any case, the issue is one that requires agood deal of careful analysis and it is an issue thatcannot be ignored.10. CONCLUSIONThe notion of language theoretic complexity, both withrespect to the Chomsky hierarchy and with respect tothe sub-regular hierarchies, is an essential tool in AGLexperiments. In the design of experiments they providea way of formulating meaningful, testable hypotheses,of identifying relevant classes of patterns, of findingminimal pairs of languages that distinguish those classesand of constructing sets of stimuli that resolve theboundaries of those languages. In the interpretation ofthe results of the experiments the properties of the com-plexity classes provide a means of identifying the patterna subject has generalized to, the class of patterns thesubject population is capable of generalizing to and, ulti-mately, a means of identifying those features of thestimulus that the cognitive mechanisms being used aresensitive to.

    In this paper, we have presented a scale for informingthese issues that is both finer than and broader than thefinite state/context-free contrast that has been the focusof much of the AGL work to date. While some of the dis-tinctions between classes are subtle, and some of theanalyses delicate, there are effective methods for dis-tinguishing them that are generally not hard to applyand the range of characterizations of the classes providesa variety of tools that can be employed in doing so. Moreimportantly, the capabilities distinguished by theseclasses are very likely to be significant in resolving theissues that much of this research is intended to explore.

    Finally, fully abstract characterizations of languageclasses, like many of those we have presented here,provide information about characteristics of the pro-cessing mechanism that are necessarily shared by allmechanisms that are capable of recognizing languagesin these classes. This provides a foundation for

    http://rstb.royalsocietypublishing.org/

  • Formal language theory G. Jager and J. Rogers 1969

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from unambiguous and strongly supported results aboutcognitive mechanisms for pattern recognition.

    We thank William Tecumseh Fitch and the anonymousreviewers for immensely helpful comments on the draftversion of this article.ENDNOTES1Authoritative textbooks on this field are [1,2].2This points to another simplification that is needed when applying

    FLT to natural languages: in each language with productive word

    formation rules, the set of possible words is unbounded. Likewise,

    the set of morphemes is in principle unbounded if loans from

    other languages, acronym formation and similar processes are

    taken into consideration. It is commonly assumed here that the

    object of investigation is an idealized language that does not undergo

    change. When the vocabulary items are identified with words, it is

    tacitly taken for granted that the words of a language form a finite

    number of grammatical categories, and that it is thus sufficient to

    consider only a finite number of instances of each class.3The term context-sensitive has only historical significance. It has

    noting to do with context-dependency in a non-technical sense in

    any way. The same applies to the term context-free.4In the terminology of computational complexity theory, the prob-

    lem is PSPACE hard.5In context-free grammars, the right-hand side of a rule may be the

    empty string, while in context-sensitive grammars this is not licit.

    Therefore, strictly speaking, not every context-free grammar is con-

    text-sensitive. This is a minor technical point though that can be

    ignored in the present context.6Equivalently, we may demand that the rules take the form A ! aor A ! Ba, with the non-terminal, if present, preceding the term-inal. It is crucial though that within a given grammar, all rules start

    with a terminal on the right-hand side, or all rules end with a

    terminal.7Note that here one of the idealizations mentioned above come into

    play here: It is taken for granted that a productive patternforming

    a neithernor construction out of two grammatical sentencescan be

    applied to arbitrarily large sentences to form an even larger sentence.8Here, we strictly refer to the problem whether the set of strings

    of grammatical English sentences is a CFL, disregarding all further

    criteria for the linguistic adequacy of a grammatical description.9A thorough discussion of types of dependencies in natural

    languages and mild context-sensitivity can be found in [17].10Even more strongly, it is generally possible to convert descriptions

    from one class of models to descriptions in an equivalent class fully

    automatically.11This does not imply that mechanisms that are physically finitely

    boundedthe brain of an organism, for exampleare restricted to

    recognizing only finite-state languages. The organism may well

    employ a mechanism that, in general, requires unbounded

    memory which would simply fail when it encounters a string that

    is too complex, if it ever did encounter such a string.12More details on this and the other local classes of languages can be

    found in [24].13We use capital letters here to represent arbitrary symbols drawn

    from mutually distinct categories of the symbols of the vocabulary.

    Although none of our mechanisms involve the sort of string rewriting

    employed by the grammars of the first part of this paper and we dis-

    tinguish no formal set of non-terminals, there is a rough parallel

    between this use of capital letters to represent categories of terminals

    and the interpretation of non-terminals as representing grammatical

    categories in phrase-structure grammars.14The patterns are both defined in terms of the parameters i and j so

    that the length of the strings do not vary between them.15The successor relationship is definable using only , and quantifi-

    cation, so one no longer explicitly needs the successor relation.

    Similarly, multiplicity of factors can be defined in terms of their

    order, so one does not actually need to count to a threshold greater

    than 1.16Technically, the final probabilities of this distribution were all 0,

    i.e. the distribution included no finite strings.Phil. Trans. R. Soc. B (2012)17In which only a single non-terminal occurs at any point in the

    derivation.18Note that two counters would suffice to recognize AnBmCmDn but,

    as Minsky showed [33], two counters suffice to recognize any

    computable language.19The assumption that the generalization is not arbitrary implies,

    inter alia, that if it includes strings that are longer than those in F

    it will include strings of arbitrary length. This allows one to verify

    that a subject has not simply made some finite generalization of

    the (necessarily finite) set F.REFERENCES1 Hopcroft, J. E., Motwani, R. & Ullman, J. D. 2000 Intro-

    duction to automata theory, languages and computation.Reading, MA: Addison-Wesley.

    2 Sipser, M. 1997 Introduction to the theory of computation.Boston, MA: PWS Publishing.

    3 Chomsky, N. 1956 Three models for the description oflanguage. IRE Trans. Inform. Theory 2, 113124.(doi:10.1109/TIT.1956.1056813)

    4 Chomsky, N. 1957 Syntactic structures. The Hague, TheNetherlands: Mouton.

    5 Huybregts, R. 1984 The weak inadequacy of context-freephrase structure grammars. In Van periferie naar kern(eds. G. J. de Haan, M. Trommelen & W. Zonneveld),

    ppp. 8199. Dordrecht, The Netherlands: Foris.6 Shieber, S. 1985 Evidence against the context-freeness of

    natural language. Linguist. Phil. 8, 333343. (doi:10.1007/BF00630917)

    7 Culy, C. 1985 The complexity of the vocabulary ofBambara. Linguist. Philos. 8, 345351. (doi:10.1007/BF00630918)

    8 Joshi, A. 1985 How much context-sensitivity is necessaryfor characterizing structural descriptionstree adjoining

    grammars. In Natural language processing. Theoretical,computational and psychological perspectives (eds.D. Dowty, L. Karttunen & A. Zwicky). Cambridge,UK: Cambridge University Press.

    9 Ades, A. E. & Steedman, M. J. 1982 On the order of

    words. Linguist. Philos. 4, 517558. (doi:10.1007/BF00360804)

    10 Steedman, M. 2000 The syntactic process. Cambridge,MA: MIT Press.

    11 Joshi, A., Vijay-Shanker, K. & Weir, D. 1991 The conver-

    gence of mildly context-sensitive grammar formalisms. InProcessing of linguistic structure (eds. P. Sells, S. Shieber &T. Wasow), pp. 3181. Cambridge, MA: MIT Press.

    12 Gazdar, G. 1988 Applicability of indexed grammars to

    natural languages. Technical Report 8534, Center forthe Study of Language and Information, Stanford, CA.

    13 Pollard, C. J. 1984 Generalized phrase structure gram-mars, head grammars and natural language. PhDthesis, Stanford, CA.

    14 Weir, D. J. 1988 Characterizing mildly context-sensitivegrammar formalisms. PhD thesis, University ofPennsylvania.

    15 Stabler, E. P. 1997 Derivational minimalism. In Logicalaspects of computational linguistics (ed. C. Retore), pp.6895. Berlin, Germany: Springer.

    16 Chomsky, N. 1995 The minimalist program. Cambridge,MA: MIT Press.

    17 Stabler, E. P. 2004 Varieties of crossing dependencies:structure dependence and mild context sensitivity. Cogn.Sci. 28, 699720. (doi:10.1207/s15516709cog2805_4)

    18 Becker, T., Joshi, A. & Rambow, O. 1991 Long-distancescrambling and tree adjoining grammars. In Proc. 5thConf. of European Chapter of the Association for Compu-tational Linguistics, Berlin, Germany, 911 April 1991,pp. 2126. Association for Computational Linguistics.

    http://dx.doi.org/10.1109/TIT.1956.1056813http://dx.doi.org/10.1007/BF00630917http://dx.doi.org/10.1007/BF00630917http://dx.doi.org/10.1007/BF00630918http://dx.doi.org/10.1007/BF00630918http://dx.doi.org/10.1007/BF00360804http://dx.doi.org/10.1007/BF00360804http://dx.doi.org/10.1207/s15516709cog2805_4http://rstb.royalsocietypublishing.org/

  • 1970 G. Jager and J. Rogers Formal language theory

    on September 11, 2014rstb.royalsocietypublishing.orgDownloaded from 19 Michaelis, J. & Kracht, M. 1997 Semilinearity as a syn-tactic invariant. In Logical aspects of computationallinguistics (ed. C. Retore), pp. 329345. Berlin,Germany: Springer.

    20 Tesniere, L. 1959 Elements de syntaxe structurale. Paris,France: Klincksiek.

    21 Rogers, J. 2003 wMSO theories as grammar formalisms.Theoret. Comput. Sci. 293, 291320. (doi:10.1016/S0304-3975(01)00349-8)

    22 McNaughton, R. & Papert, S. A. 1971 Counter-free auto-mata. Cambridge, MA: MIT Press.

    23 Rogers, J., Heinz, J., Bailey, G., Edlefsen, M., Visscher,

    M., Wellcome, D. & Wibel, S. 2010 On languages piece-wise testable in the strict sense. In The mathematics oflanguage: revised selected papers from the 10th and 11thBiennial Conference on the Mathematics of Language (eds.C. Ebert, G. Jager & J. Michaelis), Lecture Notes in

    Computer Science/Lecture Notes in Artificial Intelli-gence, vol. 6149, pp. 255265. Berlin, Germany:FoLLI/Springer.

    24 Rogers, J. & Pullum, G. 2007 Aural pattern recognitionexperiments and the subregular hierarchy. In Proc. 10thMathematics of Language Conference (ed. M. Kracht),pp. 17. Los Angeles, CA: University of California.

    25 Christiansen, M. H. & Chater, N. (ed.) 2001 Connection-ist psycholinguistics. New York, NY: Ablex.

    26 Saffran, J. R., Aslin, R. N. & Newport, E. L. 1996

    Statistial learning by 8-month-old infants. Science 274,19261928. (doi:10.1126/science.274.5294.1926)Phil. Trans. R. Soc. B (2012)27 Saffran, J. R. 2001 The use of predictive dependencies inlanguage learning. J. Mem. Lang. 44, 493515. (doi:10.1006/jmla.2000.2759)

    28 Manning, C. & Schutze, H. 1999 Foundations of statisticalnatural language processing. Cambridge, MA: MIT Press.

    29 Heinz, J. & Rogers, J. 2010 Estimating strictly piecewisedistributions. In Proc. 48th Annual Meeting of theAssociation for Computational Linguistics, Uppsala,Sweden, 1116 July 2010, pp. 886896. Association forComputational Linguistics.

    30 Reber, A. S. 1967 Implicit learning of artificial gram-mars. J. Verb. Learn. Verb. Behav. 6, 855863. (doi:10.1016/S0022-5371(67)80149-X)

    31 Marcus, G. G., Vijayan, S., Bandi Rao, S. & Vishton,P. M. 1999 Rule learning by seven-month-old infants.Science 283, 7779. (doi:10.1126/science.283.5398.77)

    32 Gomez, R. L. & Gerken, L. 1999 Artificial grammar

    learning by 1-year-olds leads to specific and abstractknowledge. Cognition 70, 109135. (doi:10.1016/S0010-0277(99)00003-7)

    33 Minsky, M. L. 1961 Recursive unsolvability of Postsproblem of tag and other topics in the theory ofTuring machines. Ann. Math. 74, 437455. (doi:10.2307/1970290)

    34 Pullum, G. K. & Gazdar, G. 1982 Natural languages andcontext-free languages. Linguist. Phil. 4, 471504.(doi:10.1007/BF00360802)

    35 Pullum, G. K. 1985 On two recent attempts to show thatEnglish is not a CFL. Comput. Linguist. 10, 182186.

    http://dx.doi.org/10.1016/S0304-3975(01)00349-8http://dx.doi.org/10.1016/S0304-3975(01)00349-8http://dx.doi.org/10.1126/science.274.5294.1926http://dx.doi.org/10.1006/jmla.2000.2759http://dx.doi.org/10.1006/jmla.2000.2759http://dx.doi.org/10.1016/S0022-5371(67)80149-Xhttp://dx.doi.org/10.1016/S0022-5371(67)80149-Xhttp://dx.doi.org/10.1126/science.283.5398.77http://dx.doi.org/10.1016/S0010-0277(99)00003-7http://dx.doi.org/10.1016/S0010-0277(99)00003-7http://dx.doi.org/10.2307/1970290http://dx.doi.org/10.2307/1970290http://dx.doi.org/10.1007/BF00360802http://rstb.royalsocietypublishing.org/

    Formal language theory: refining the Chomsky hierarchyIntroductionThe Chomsky HierarchyComputably enumerable languagesContext-sensitive languagesContext-free languagesRegular languages

    Where are natural languages located?Mildly context-sensitive languagesCognitive complexitySubregular languagesStrictly local languagesProbing the SL boundaryLocally k-testable languagesProbing the LT boundaryFO(+1) definability: LTTProbing the LTT boundaryFO(