acquisition of cognitive skill - semantic scholaracquisition of cognitive skill john r. anderson...

Psychological Review1982, Vol. 89, No. 4, 369-406

Copyright 1982 by the American Psychological Association, Inc.0033-295X/82/8904-0369S00.75

Acquisition of Cognitive Skill

John R. AndersonCarnegie-Mellon University

A framework for skill acquisition is proposed that includes two major stages inthe development of a cognitive skill: a declarative stage in which facts about theskill domain are interpreted and a procedural stage in which the domain knowl-edge is directly embodied in procedures for performing the skill. This generalframework has been instantiated in the ACT system in which facts are encodedin a propositional network and procedures are encoded as productions. Knowledgecompilation is the process by which the skill transits from the declarative stageto the procedural stage. It consists of the subprocesses of composition, whichcollapses sequences of productions into single productions, and proceduralization,which embeds factual knowledge into productions. Once proceduralized, furtherlearning processes operate on the skill to make the productions more selectivein their range of applications. These processes include generalization, discrimi-nation, and strengthening of productions. Comparisons are made to similar con-cepts from past learning theories. How these learning mechanisms apply to pro-duce the power law speedup in processing time with practice is discussed.

It requires at least 100 hours of learningand practice to acquire any significant cog-nitive skill to a reasonable degree of profi-ciency. For instance, after 100 hours a stu-dent learning to program a computer hasachieved only a very modest facility in theskill. Learning one's primary language takestens of thousands of hours. The psychologyof human learning has been very thin in ideasabout what happens to skills under the im-pact of this amount of learning—and forobvious reasons. This article presents a the-ory about the changes in the nature of a skillover such large time scales and about thebasic learning processes that are responsible.

This research was sponsored by the Personnel andTraining Research Programs, Psychological SciencesDivision, Office of Naval Research, under ContractN00014-81-C-0335, Contract Authority IdentificationNumber, NR 157-465, and Grant IST-80-15357 fromthe National Science Foundation. My ability to put to-gether this theory has depended critically on input fromcolleagues I have worked with over the past few years—Charles Beasley, Jim Greeno, Paul Kline, Pat Langley,and David Neves. This is not to suggest that any of theabove would endorse all of the ideas in this paper. Iwould like to thank those who have provided me wi,thvaluable advice and feedback on the paper: Renee Elio,Jim Greeno, Paul Kline, Jill Larkin, Clayton Lewis,Miriam Schustack, and especially Lynne Reder.

Requests for reprints should be sent to John Ander-son, Department of Psychology, Carnegie-Mellon Uni-versity, Schenley Park, Pittsburgh, Pennsylvania 15213.

Fitts (1964) considered the process of skillacquisition to fall into three stages of de-velopment. The first stage, called the cog-nitive stage, involves an initial encoding ofthe skill into a form sufficient to permit thelearner to generate the desired behavior toat least some crude approximation. In thisstage it is common to observe verbal media-tion in which the learner rehearses infor-mation required for the execution of the skill.The second stage, called the associativestage, involves the "smoothing out" of theskill performance. Errors in the initial un-derstanding of the skill are gradually de-tected and eliminated. Concomitantly, thereis a dropout of verbal mediation. The thirdstage, the autonomous stage, is one of grad-ual continued improvement in the perfor-mance of the skill. The improvements in thisstage often continue indefinitely. Althoughthese general observations about the courseof skill development seem true for a widerange of skills, they have defied systematictheoretical analysis.

The theory to be presented in this articleis in keeping with these general observationsof Fitts (1964) and provides an explanationof the phenomena associated with his threestages. In fact, the three major sections ofthis article correspond to the three stages.In the first stage the learner receives instruc-

369

370 JOHN R. ANDERSON

Figure I . A representation of the flow of control in Table1 between various goals. (The boxes correspond to goalstates, and the arrows to productions that can changethese states. Control starts with the top goal.)

tion and information about a skill. The in-struction is encoded as a set of facts aboutthe skill. These facts can be used by generalinterpretive procedures to generate behavior.This initial stage of skill corresponds toFitts's cognitive stage. In this article it willbe referred to as the declarative stage. Ver-bal mediation is frequently observed becausethe facts have to be rehearsed in workingmemory to keep them available for the in-terpretive procedures.

According to the theory to be presentedhere, Fitts's second stage is really a transi-tion between the declarative stage and a laterstage. With practice the knowledge is con-verted into a procedural form in which it isdirectly applied without the intercession ofother interpretive procedures. The gradualprocess by which the knowledge is convertedfrom declarative to procedural form is calledknowledge compilation. Fitts's associativestage corresponds to the period over whichknowledge compilation applies.

According to the theory, Fitts's autono-mous stage involves further learning thatoccurs after the knowledge achieves proce-dural form. In particular, there is furthertuning of the knowledge so that it will applymore appropriately, and there is a gradualprocess of speedup. This will be called theprocedural stage in this article.

This article presents a detailed theoryabout the use and development of knowledgein both the declarative and procedural form

and about the transition between these twoforms. The theory is based on the ACT pro-duction system (Anderson, 1976) in whichthe distinction between procedural and de-clarative knowledge is fundamental. Proce-dural knowledge is represented as produc-tions, whereas declarative knowledge isrepresented as a propositional network. Be-fore describing the theory of skill acquisi-tion, it will be necessary to specify some ofthe basic operating principles of the ACTproduction system.

The ACT Production System

The ACT production system consists ofa set of productions that can operate on factsin the declarative data base. Each produc-tion has the form of a primitive rule thatspecifies a cognitive contingency, that is, aproduction specifies when a cognitive actshould take place. The production has a con-dition that specifies the circumstances underwhich the production can apply and an ac-tion that specifies what should be done whenthe production applies. The sequence of pro-ductions that apply in a task correspond tothe cognitive steps taken in performing thetask. In the actual computer simulations,these production rules have often quite tech-nical syntax, but in this article I will usuallygive the rules quite English-like renditions.For current purposes application of a pro-duction can be thought of as a step of cog-nition. Much of the ACT performance the-ory is concerned with specifying howproductions are selected to apply, and muchof the ACT learning theory is concernedwith how these production rules are ac-quired.

An Example

To explain some of the basic concepts ofthe ACT production system, it is useful tohave an illustrative set of productions thatperforms some simple task. Such a set ofproductions for performing addition is givenin Table 1. Figure 1 illustrates the flow ofcontrol in that production set among goals.It is easiest to understand such a productionsystem by tracing its application to a prob-lem such as adding a column of numbers:

ACQUISITION OF COGNITIVE SKILL 371

Table 1A Production System for Performing Addition

PI.

P2.

P3.

P4.

P5.

P6.

P7.

IFTHEN

IF

THEN

IF

THEN

IF

THEN

IF

THEN

IF

THEN

IF

THEN

the goal is to do an addition problem,the subgoal is to iterate through the

columns of the problem.

the goal is to iterate through thecolumns of an addition problem

and the rightmost column has not beenprocessed,

the subgoal is to iterate through therows of that rightmost column

and set the running total to zero.


and a column has just been processedand another column is to the left of

in is column,the subgoal is to iterate through the

rows of this column to the leftand set the running total to the carry.


and the last column has been processedand there is a carry,write out the carryand POP the goal.


and the last column has been processedand there is no carry,POP the goal.

the goal is to iterate through the rowsof a column

and the top row has not beenprocessed,

the subgoal is to add the digit of thetop row into the running total.

the goal is to iterate through the rowsof a column

and a row has just been processedand another row is below it,the subgoal is to add the digit of the

lower row to the running total.

P8. IF the goal is to iterate through the rowsof a column

and the last row has been processedand the running total is a digit,

THEN write the digitand delete the carryand mark the column as processedand POP the goal.

P9. IF the goal is to iterate through the rowsof a column

and the last row has been processedand the running total is of the form

"string + digit,"THEN write the digit

and set carry to the stringand mark the column as processedand POP the goal.

P10. IF the goal is to add a digit to a numberand the number is a digitand a sum is the sum of the two digits,

THEN the result is the sumand mark the digit as processedand POP the goal.

PI 1. IF the goal is to add a digit to a numberand the number is of the form

"string + digit"and a sum is the sum of the two digitsand the sum is less than 10,

THEN the result is "string + sum"and mark the digit as processedand POP the goal.

PI 2. IF the goal is to add a digit to a numberand the number is of the form

"string + digit"and a sum is the sum of the two digitsand the sum is of the form " 1 +

digit*"and another number sum* is the sum

of 1 plus string,THEN the result is "sum* + digit*"

and mark the digit as processedand POP the goal.

614438683.

Production PI is the first to apply and wouldset as a subgoal to iterate through the col-umns. Then Production P2 applies andchanges the subgoal to adding the digits ofthe rightmost column. It also sets the run-ning total to 0. Then Production P6 appliesto set the new subgoal to adding the top digit

, of the column (4) to the running total. Interms of Figure 1, this sequence of threeproductions has moved the system downfrom the top goal of doing the problem tothe bottom goal of performing a basic ad-dition operation. The system has the fourgoals in Figure 1 stacked with attention fo-cused on the bottom goal.

At this point Production P10 applies,which calculates 4 as the new value of therunning total and POPs the goal of adding


the digit to the running total. This amountsto removing this goal from the stack andreturning attention to the goal of iteratingthrough the rows of the column. Then P7applies, which sets the new subgoal of adding8 into the running total. P10 applies againto change the running total to 12; then P7applies to create the subgoal of adding 3 intothe running total; then Pll calculates thenew running total as 15. At this point thesystem is back at the goal of iteratingthrough the rows and has processed the bot-tom row of the column. Then Production P9applies, which writes out the 5 in 15, sets thecarry to the 1, and POPs back to the goalof iterating through the columns. At thispoint the production system has processedone column of the problem.

I will not trace out any further the appli-cation of this production set to the problem,but the reader is invited to carry out thehand simulation. Note that Productions P2-P5 form a subroutine for iterating throughthe columns, Productions P6-P9 an embed-ded subroutine for processing a column, andProductions P10-P12 an embedded subrou-tine for adding a digit to the running total.In Figure 1 all the productions correspond-ing to a subroutine emanate from the samegoal box.

Significant Features of the PerformanceSystem

A number of features of the productionsystem are important for the learning theoryto be presented. The productions themselvesare the system's procedural component. Fora production to apply, the clauses specifiedin its condition must be matched against in-formation active in working memory. Thisinformation in working memory is part ofthe system's declarative component. Else-where (Anderson, 1976, 1980) I have dis-cussed the network encoding of that declar-ative knowledge and the process of spreadingactivation defined on that network.

Goal structure. As noted above, the pro-ductions in Table 1 are organized into sub-routines, where each subroutine is associatedwith a goal state that all the productions inthe subroutine are trying to achieve. Becausethe system can have only one goal at any

moment in time, productions from only oneof these subroutines can apply at any onetime. This forces a considerable seriality intothe behavior of the system. These goal-seek-ing productions are hierarchically organized.The idea that hierarchical structure is fun-damental to human cognition has been em-phasized by Miller, Galanter, and Pribram(I960) and many others. Greeno (1976)used the idea of goal structuring for pro-duction systems. Brown and Van Lehn(1980) have recently introduced a similargoal structuring for production systems.

In the original ACT system (Anderson,1976), there was a scheme for achieving theeffect of subroutines by the setting of controlvariables. There are several important dif-ferences between the current scheme andthat older one. First, as noted, the currentscheme forces a strong degree of serialityinto the system. Second, because the goalsare not arbitrary nodes but rather meaning-ful assertions, it is much easier for ACT'slearning system to acquire new productionsthat make reference to goals. Evidence forthis last assertion will be given as the variousACT learning mechanisms are discussed.

In achieving a hierarchical subroutinestructure by means of a goal-subgoal struc-ture, I am accepting the claim that the hi-erarchical control of behavior derives fromthe structure of problem solving. Thisamounts to making the assertion that prob-lem solving, and the goal structure it pro-duces, is a fundamental category of cogni-tion. This assertion has been advanced byothers (e.g., Newell, 1980). Thus, this learn-ing discussion contains a rather strong pre-supposition about the architecture of cog-nition. I think the presupposition is tooabstract to be defended directly; rather, ev-idence for it will come from the fruitfulnessof the systems that we can build based onthe architectural assumption.

Conflict resolution. Every productionsystem requires some rules of conflict reso-lution, that is, principles for deciding whichof those productions that match will be ex-ecuted. ACT has a set of conflict-resolutionprinciples that can be seen as variants of the1976 ACT (Anderson, 1976) or the OPSsystem (Forgy & McDermott, 1977). Onepowerful principle is refractoriness: that the


same production cannot apply to the samedata in,working memory twice in the sameway. This prevents the same production fromrepeating over and over again and was im-plicit in the preceding hand simulation ofTable 1.

The two other principles of conflict reso-lution in ACT are specificity and strength.Neither was illustrated in Table 1 but bothare important to understanding the learningdiscussion. If two productions can apply andthe condition of one is more specific than theother, then the more specific productiontakes precedence. Condition A is more spe^cine than Condition B if the set of situationsin which Condition A can match is a propersubset of the set of situations where Con-dition B can match. The specificity rule al-lows exceptions to general rules to applybecause these exceptions will have more spe-cific conditions. For instance, suppose wehad the following pair of productions:

PA. IF the goal is to generate the plural of man,THEN say "MEN."

PB. IF the goal is to generate the plural of a noun,THEN say "noun + s."

The condition of Production PA is more spe-cific than the condition of Production PB andso will apply over the general pluralizationrule.

Each production has a strength that re-flects the frequency with which that pro-duction has been successfully applied. I willdescribe the rules that determine strengthaccumulation later in this article; here I de-scribe the role of production strength in con-flict resolution. Elsewhere (e.g., Anderson,1976; Anderson, Kline, & Beasley, 1979) Ihave given a version of this role of strengththat assumes discrete time intervals. HereI give a continuous version. Productions areindexed by the constants in their conditions.For instance, the Production PA abovewould be indexed by plural and man. If theseconcepts are active in working memory, theproduction will be selected for consideration.In this way ACT can focus its attention onjust the subset of productions that may bepotentially relevant. Only if a production isselected is a test made to see if its conditionis satisfied. (For future reference if a pro-duction is selected, it is said to be on the

APPLYLIST.) A production takes a time T,to be selected and another time T2 to betested and to apply. The selection time Ttvaries with the production's strength, whereasthe application time is a constant over pro-ductions. It is further assumed that the timeT] for the production to be selected will ran-domly vary from selection to selection. Theexpected time is a/s where s is the produc-tion strength and a is a constant. Althoughthere are no compelling reasons for makingany assumption about the distribution, wehave assumed that Tj has an exponentialdistribution, and this is its form in all oursimulations.

A production will actually apply if it isselected and it has completed applicationbefore a more specific production is selected.This provides the relationship betweenstrength and specificity in the theory. Amore specific production will take prece-dence over a more general production onlyif its selection time is less than the selectionplus application times of the more generalproduction. Because strength reflects fre-quency of practice, only exceptions that havesome criterion frequency will be able to re-liably take precedence over general rules.This corresponds, for instance, to the factthat words with irregular inflections tend tobe of relatively high frequency. It is possiblefor an exception to be of borderline strengthso that it sometimes is selected in time tobeat out the general rule but sometimes not.This corresponds, for instance, to the stagein language development when an irregularinflection is being used with only partial re-liability (R. Brown, 1973).

Variables. Productions contain variableslots that can take on different values in dif-ferent situations. The use of these variablesis often implicit, as in Table 1, but sometimesit is important to acknowledge the variablesthat are being assumed. As an illustrationlet us consider a variabilized form of a pro-duction from Table 1. If Production P9 fromthat table were to be written in a way toexpose its variable structure, it would havethe form below, where the terms prefixed byLV are local variables:

IF the goal is to iterate through the rows of LVcol-umn

and LVrow is the last row of LVcoluran


and LVrow has been processedand the running total is of the form "LVstring +

LVdigit,"THEN write LVdigit

and set carry to LVstringand mark LVcolumn as processedand POP the goal.

Local variables can be reassigned to newvalues each time the production applies.Thus, for instance, the terms LVcolumn,LVrow, LVstring, and LVdigit will match towhatever elements lead to a complete matchof the condition to working memory. Sup-pose, for instance, that the following ele-ments were in working memory:The goal is to iterate through the rows of Column 2.Row x is the last row of Column 2.Row x has been processed.Running total is of the form 2 + 4.

The production would match this workingmemory information with the following vari-able bindings:

LVcolumn = Column 2.LVrow = Row x.LV string = 2.LVdigit = 4.

Local variables assume values within a pro-duction for the purposes of matching thecondition and executing the action. Afterapplication of the production, variables losetheir values.

Learning in ACT

This article is concerned with the pro-cesses underlying the acquisition of cognitiveskill. As is clear from examples like Table1, there is a closer connection in ACT be-tween productions and skill performancethan between declarative knowledge andskill performance. This is because the controlover cognition and behavior lies directly inthe productions. Facts are used by the pro-ductions. So in a real sense facts are instru-ments of the productions, which are theagents. For instance, we saw that ProductionP10 used the addition fact that 4 + 8 = 12.Although productions are closer to perfor-mance than facts, I will be claiming thatwhen a person initially learns about a skill,he or she learns only facts about the skilland does not directly acquire productions.These facts are used interpretively by gen-eral-purpose productions. The first major

section of this article, on the declarativestage, will illustrate how general-purposeproductions can interpret these facts to gen-erate performance of the skill.

The next major section of the article willdiscuss the evidence for and nature of theknowledge compilation process that resultsin the translation from a declarative base fora skill to a procedural base for the skill. (Forinstance, the production set in Table 1 is aprocedural base for the addition skill.) Afterthis section the remainder of the article willdiscuss two aspects of the continued im-provement of a skill after it has achieved aprocedural embodiment: one is the tuning ofskill so that it is applied more appropriately;the second is the very lawful way in whichapplication of the skill speeds up.

The Declarative Stage: InterpretiveProcedures

One of the things that becomes apparentin studying the initial stages of skill acquisi-tion in areas of mathematics like geometryor algebra (e.g., Neves, 1981) is that theinstruction seldom if ever directly specifies

1-7 Proofs in Two-Column FormYou prove a statement in geometry bv using deductive reasoning to showthat the statement follows from the hypothesis and other accepted materialOften the assertions made in a proof are listed in one column, and reasonswhich support the assertions are listed in an adjacent column.

EXAMPLE. A proof in two-column form.

Given: AKO. AD = AB

Prove: AK + KD = AB

Proof

STATEMENTS REASONS

1 AKD

2 AK + KD = AD

3 AD = AB

4 AK + KD = AB

1 Given2 Definition of between3 Given4. Transitive property of equality

Some people prefer to support Statement 4, above, with the reason TheSubstitution Principle. Both reasons are correct.

The reasons used in the example are of three types: Given (Steps 1 and3), Definition (Step 2). and Postulate (Step 4). Just one other kind of reason,Theorem, can be used in a mathematical proof. Postulates and theoremsfrom both algebra and geometry can be used.

Reasons Used in Proofs

Given (Facts provided for a particular problem)DefinitionsPostulatesTheorems that have already been proved,

Figure 2. The text instruction for a two-column proof.(From Geometry by R. C. Jurgensen, A. J. Donnelly,J. E. Maier, and G. R. Rising. Boston: Houghton Mif-flin, 1975, p. 25. Copyright 1975 by Houghton MifflinCo. Reprinted by permission.)


a procedure to be applied. Still, the studentis able to emerge from this type of instruc-tion with an ability to generate behavior thatreflects knowledge contained in the instruc-tion. Figures 2, 3, and 4 from my work ongeometry illustrate this point. Figure 2 istaken from the text of Jurgensen, Donnelly,Maier, & Rising (1975) and represents thetotal of that text's instruction on two-columnproofs. Immediately after studying this,three of our students attempted to give rea-sons for two-column proof problems. Thefirst such proof problem is the one illustratedin Figure 3. All three of the students wereable to deal with this problem with somesuccess. Behavior on these reason-givingproblems i$ rather constant across subjects,at least at a global level. Figure 4 is a rep-resentation at the global level of these con-stancies. Clearly, nowhere in Figure 2 isthere a specification of the flow of controlthat is in Figure 4. However, before readingthe instruction of Figure 2, subjects were notcapable of the flow of control in Figure 4,and after reading the instruction they were.So somehow the instruction in Figure 2makes the procedure in Figure 4 possible.

Given that the instructions do not specifyflow of control, the learners must call onexisting procedures to direct their behaviorin this task. These procedures must use theinformation specified in the instructions toguide their behavior. The instructional in-formation is being used by these proceduresin the same way Production P10 in Table 1

Written Copy everything shown Complete the proof by writing reasons.

Exercises *en PONY. RO — A/V

Prove RN = OY

1 RO 35 A/V

2 RO = NY

3 ON = ON4 RO + ON = ON H

5 RONY6 RO + ON = R/V7 0/V + A/V = OY

8 RN = OY

Def of 3 segmentsReflexive prop of equalAdd. prop of equalGivenDef. of betweenDef. of betweenSubstitution principle

Figure 3. A reason-giving task that is the first problemthat the student encounters requiring use of the knowl-edge about two-column proofs. (This figure is takenfrom the instructors' copy of the text and shows thecorrect reasons. These reasons are not given in the stu-dents' text. From Geometry by R. C. Jurgensen, A. J.Donnelly, J. E. Majer, and G. R. Rising. Boston: Hough-ton MifHin, 1975, p. 26. Copyright 1975 by HoughtonMifflin Co. Reprinted by permission.)

Figure 4. A flowchart showing the general flow of con-trol in a reason-giving task.

used the addition facts to guide its behavior.For this reason, I say that the knowledgeabout the skill is being used interpretatively,The term reflects the fact that the knowledgeis data for other procedures in just the waya computer program is data for an inter-preter.

The basic claim is that general interpre-tative procedures with no domain-specificknowledge can be applied to some factsabout the domain and produce coherent anddomain-appropriate behavior. On first con-sideration it was not obvious to many people,including myself, how it was possible tomodel novice behavior in a task like Figure3 by a purely interpretive system. However,it is absolutely critical to the theory pre-sented here that there be an interpretive sys-tem in ACT that accurately describes thebehavior of a novice in a new domain. It isessential because the theory claims thatknowledge in a new domain always starts outin declarative form and is used interpre-tively. Therefore, I took it as a critical testcase to develop a detailed example of howa student, extracting only declarative infor-mation from Figure 2, could generate task-appropriate behavior in Figure 3. Theexample, described below, assumes somegeneral-purpose working-backwards prob-lem-solving techniques. It captures importantaspects of the behavior of my three highschool students on this problem.

I want to emphasize that I am not claim-ing that such general problem-solving pro-cedures are the only way that learners canbridge the gap between instruction and be-


Table 2Interpretive Productions Evoked in Performing the Reason-Giving Task

PI. IF the goal is to do a list of problems,THEN set as a subgoal to do the first problem

in the list.P2. IF the goal is to do a list of problems

and a problem has just been finished,THEN set as a subgoal to do the next

problem.

P3. IF the goal is to do a list of problemsand there are no unfinished problems

on the list,THEN POP the goal with success.

P4. IF the goal is to write the name of arelation for an argument,

THEN set as a subgoal to find what therelation is for the argument.

P5. IF the goal is to write the name of arelation for an argument

and a name has been found,THEN write the name

and POP the goal with success.

P6. IF the goal is to write the name of arelation for an argument

and no name has been found,THEN POP the goal with failure,

P7. IF the goal is to find a relationand there is a list of methods for

achieving the relation,THEN set as a subgoal to try the first method.


achieving the relationand a method has just been

unsuccessfully tried,THEN set as a subgoal to try the next

method.


achieving the relationand a method has been successfully

tried, \THEN POP the goal with success.


achieving the relationand they have all proven unsuccessful,

THEN POP the goal with failure.

PI 1. IF the goal is to try a methodand that method involves establishing a

relationship.THEN set as a subgoal to establish the

relationship.

PI2. IF the goal is to try a methodand the subgoal was a success,

THEN POP the goal with success.

PI3. IF the goal is to try a methodand the subgoal was a failure,


PI4. IF the goal is to establish that a statementis among a list

and the list contains the statement,THEN POP the goal with success.

PI5. IF the goal is to establish that a statementis among a list

and the list does not contain thestatement,


PI6. IF the goal is to establish that a line isimplied by a rule in a set

and the set contains a rule of the form"consequent if antecedents"

and the consequent matches the line,THEN set as a subgoal to determine if the

antecedents correspond to establishedstatements

and tag the rule as tried.


and the set contains a rule of the form"consequent if antecedents"

and the consequent matches the lineand the antecedents have been

established,THEN POP the goal with success.


and there is no untried rule in the setthat matches the line,


PI9. IF the goal is to determine if antecedentscorrespond to established statements

and there is an unestablishedantecedent clause

and the clause matches an establishedstatement,

THEN tag the clause as established.

P20. IF the goal is to determine if antecedentscorrespond to established statements

and there are no unestablishedantecedent clauses,

THEN POP the goal with success.

P21. IF the goal is to determine if antecedentscorrespond to established statements

and there is an unestablishedantecedent clause

and it matches no establishedstatement,



havior. There are probably many differenttypes of interpretive procedures. From mywork on geometry, it is clear, for instance,that procedures for making analogies be-tween worked-out examples and new tasksserve as an additional, important means forbridging the gap.

Application of General Problem-SolvingMethods

Even though students coming upon theinstruction in Figure 2 have no proceduresspecific to doing two-column proof problems,they have procedures for solving problemsin general, for doing mathematicslike exer-cises, and perhaps even for certain types ofdeductive reasoning. These general problem-solving procedures can use the instructionsuch as that in Figure 2 as data for gener-ating task-appropriate behavior when facedwith a problem like that in Figure 3. Theyserve as the procedures for interpreting thetask-specific example. Table 2 provides a listof the set of productions that embody theneeded problem-solving procedures, andFigure 5 illustrates their flow of control whenapplied to problems like the one in Fig-ure 2.

I will trace the application of this pro-duction set to the first two lines in Figure5. It is assumed that the student encodes(declaratively) the exercise in Figure 3 asa list of subproblems, where each subprob-lem is to write a reason for a line of theproof. Production PI from Table 2 matchesto this list encoding of the problems. There-fore, PI applies first and focuses attentionon the first subproblem, that is, it sets as thesubgoal to write a reason for RO = NY.Next, Production P4 applies. P4's condition,"the goal is to write the name of a relationfor an argument," matches the currentsubgoal "to write the name of the reason forRO s NY." P4 creates the subgoal of find-ing a reason for the line. P4 is quite generaland reflects the existence of a prior proce-dure for writing statements that satisfy aconstraint.

The student presumably has encoded theboxed information in Figure 2 as indicatinga list of methods for providing a reason fora line. If so, Production P7 applies next and

Foilyre

Figure 5. A representation of the flow of control in Table2 between the various goals. (Control starts with the topgoal.)

sets as a subgoal to try givens (the first ruleon the reason list) as a justification for thecurrent line. Note this is one point where afragment of the instruction is used by a gen-eral problem-solving procedure (in this casefor searching a list of methods) to determinethe course of behavior. Two of the studentsin fact went back and reviewed the methodswhen they started this problem.

The students I studied had extracted fromthe instruction in Figure 2 that the "given"reason is used when the line to be justifiedis among the givens of the problem. Notethat this fact is not explicitly stated in theinstruction but is strongly implied. Thus, itis assumed that the student has encoded thefact that "the givens method involves estab-lishing that the statement is among the giv-ens." Production Pll will match this fact inits condition and so will set as a subgoal toestablish that RO = NY is among the givensof the problem. Production P14 models thesuccessful recognition that RO £ NY isamong the givens and returns a success fromthe subgoal. That is, its action "POP the goalwith success" tags the goal "to find RO =NY among the givens" with success and sets

as the current goal the higher goal of trying


the givens method. Then PI2 and P9 POPsuccess back up to the next-to-top-level goalof writing a reason for the line. Then Pro-duction P5 applies to write given as a reasonand POPs back to the top-level goal. In fact,all of the students likewise scanned the givenlist and had no difficulty with the first lineof this proof.

At this point Production P2 applies to setthe subgoal of writing a reason for the secondline, RO = NY. Then Productions P4, P7,and Pll apply in that order, setting thesubgoal of seeing whether RO = NY wasamong the givens of the problem. ProductionPI5 recognizes this as a failed goal and thenProduction PI3 returns control back to thelevel of choosing methods to establish a rea-son. Production P8 selects the definition rea-son to try next. Thus, the production set firstunsuccessfully tries the given method beforetrying other methods. One of the studentsexplicitly verbalized trying givens and failed.There was no indication in the protocol ofour other students that givens was consid-ered first, although it may have been im-plicitly.

Clearly, the instruction in Figure 2 con-tains no explanation of how a definitionshould be applied. However, the assumptionof the text is that the student knows that adefinition should imply the statement. Therewere some earlier exercises on conditionaland biconditional statements that makes thisassumption at least conceivable. All threesubjects knew that some inferencelike activ-ity was required, but they had a faulty un-derstanding of the nature of the applicationof inference to this task. In any case, assum-ing that the student knows as a fact (in con-trast to a procedure) that use of definitionsinvolves inferential reasoning, ProductionPll will match in its condition the fact that"definitions involve establishing that thestatement is implied by a definition," andPll will set the subgoal of proving thatRO = NY was implied by a definition.

At this point I have to leave these studentsbehind momentarily and describe the idealstudent. The textbook assumes that the stu-dent already has a functioning procedure forfinding a rule that implies a statement bymeans of a set of established rules. Produc-tions P16-P21 constitute such a procedure.

None of my students had this procedure inits entirety. These productions work as ageneral inference-testing procedure and ap-ply equally well to postulates and theoremsas well as definitions. Production P16 selectsa conditional rule that matches the currentline (the exact details of the match are notunpacked in Table 2). It is assumed that abiconditional definition is encoded as twoimplications each of the form "consequentif antecedent." The definition relevant to thecurrent Line 2 is that two line segments arecongruent if and only if they are of equalmeasure, which is encoded as

XZ = U V i f X Z g U Vand XZ s UV if XZ = UV.

The first implication is the one that is se-lected, and the subgoal is set to establish theantecedent XZ s UV (or RO ̂ NY, in thecurrent instantiation). The production setP19-P21 describes a procedure for matchingzero or more clauses in the antecedent of arule. In this case PI9 finds a match to theone condition, XZ s UV, with RO s NYin the first line. Then P20 POPs with successfollowed by successful POPping of PI7, thenPI2, and then P9, which returns the systemto the goal of writing out a reason for theline.

Significant features of the example, Iwill not further trace the application of theproduction set to the example. I would liketo identify, however, the essential aspects ofhow this production set allows the studentto bridge the gap between instruction andthe problem demands. Figure 5 illustratesthe flow of control with each box being alevel in the goal structure and serving as asubroutine. Although it is not transparent,the subgoal organization in Figure 5 resultsin the same flow of control as the flowchartorganization of Figure 4. However, as theproduction rendition of Figure 5 establishes,the flow of control in Figure 5 is not some-thing fixed ahead of time but rather emergesin response to the instruction and the prob-lem statement.

The top level goal in Figure 5 of iteratingthrough a list of problems is provided by theproblem statement and, given the problemstatement, it is unpacked into a set of


subgoals to write statements indicating thereasons for each line. This top level proce-dure reflects a general strategy the studenthas for decomposing problems into linearlyordered subproblems. Then another priorroutine sets as subgoals to find the reasons.At this point the instruction about the listof acceptable relationships is called into play(through yet another prior problem-solvingprocedure) and is used to set a series ofsubgoals to try out the various possible rea-sons for a statement. So the unpacking ofsubgoals in Figure 5 from "do a list of prob-lems" to "find a reason" is in response to theproblem statement; the further unpackinginto the methods of givens, postulates, def-initions, and theorems is in response to theinstruction. The instruction is the source ofinformation identifying that the method ofgivens involves searching the given list, andthe other methods involve application of in-ferential reasoning. The ability to search alist for a match is assumed by the text, rea-sonably enough, as a prior procedure on thepart of the student. The ability to apply in-ferential reasoning is also assumed as a priorprocedure, but in this case the assumptionis mistaken.

In summary, then, we see in Figure 5 aset of separate problem-solving proceduresthat are joined together in a novel combi-nation by the declarative information in theproblem statement and instruction. In thissense the student's general problem-solvingprocedures are interpreting the problemstatement and instruction. Note that theproblem statement and the instruction arebrought into play by being matched as datain the conditions of the productions ofTable 2.

Student understanding of implication.All three students that were studied had se-rious misunderstandings about how one de-termines whether a statement is implied bya rule, and some time was spent correctingeach student's misconceptions. However,their misunderstandings did not becomeclear on Line 2. Their faulty understandingwas sufficient for that line but the problemsshowed up on later lines. Thus, rather thana correct subroutine for inference applica-tion like the one embodied by ProductionsP16-P21, these students had subroutines

that only sometimes produced the correctanswer.

Two of the students thought that it wassufficient to determine that the consequentof the rule matched the to-be-justified state-ment and did not bother to test the anteced-ent. For them the subroutine call (subgoalsetting) of Production PI 6 did not exist. Onestudent argued, for instance, that Line 4could be justified by the substitution prin-ciple of equality because that principle gavean equality as its consequent.

The third student had more exotic mis-understandings. This is also best illustratedin his efforts to justify Line 4 (RO + ON =ON + NY) in Figure 3. The student thoughtthe transitive property of equality was theright justification for Line 4. The transitiveproperty of equality is stated as "a = b,b = c, implies a = c." The student physicallydrew out the following correspondence be-tween the antecedents of this postulate andthe to-be-justified statement:

a = b, b = cI I ! I l l

RO + ON = ON + NY '

That is, he found that he could put the vari-ables of the antecedent in order with theterms of the statement. He noted that he alsoneeded to match to a = c in the consequentof the transitive postulate but noted that aprevious line had RO = NY, which giventhe earlier variable matches satisfied hisneed.

This student had at least two misunder-standings. First, he seemed unable to appre-ciate the tight constraints on pattern match-ing (e.g., one cannot match "=" against"+"). Second, he failed to appreciate thatthe consequent of the postulate should bematched to the statement, and the anteced-ent to earlier statements. Either he had itthe other way around or, more likely, he didnot think it mattered which way it was used.However, given the instruction he had todate, this is not surprising because none ofthis was specified.

All students required remedial instruc-tion. Thus, these errors created the oppor-tunity for new learning. Although I have notanalyzed this in detail, I believe that reme-dial instruction amounted to providing ad-


ditional declarative information. This infor-mation could be used by other generalprocedures to provide interpretive behaviorin place of the compiled procedures thatTable 2 is assuming in Productions PI 6-P21. This is a simple form of debugging:When the instruction assumes precompiledprocedures that do not exist, remedial in-struction can correct the situation by pro-viding the data for interpretive procedures.

The Need for an Initial DeclarativeEncoding

This section has been concerned withshowing how students can generate behaviorin a new domain when they do not have spe-cific procedures for acting in that domain.Their knowledge of the domain is declarativeand is interpreted by general procedures.One can argue that it is adaptive for a learn-ing system to start out this way. New pro-ductions have to be integrated with the gen-eral flow of control in the system. Clearly,we are not in possession of an adequate un-derstanding of our flow of control to formsuch productions directly. One of the reasonswhy instruction is often so inadequate is thatthe teacher likewise has a poor conceptionof the flow of control in the student. At-tempts to directly encode new procedures,as in the Instructive Production System(Rychener & Newell, 1978; Rychener, Note1), have run into trouble because of thisproblem of integrating new elements into acomplex existing flow of control.

As an example of the problem with cre-ating new procedures out of whole cloth,consider the use of the definition of congru-ence by the production set in Table 2 to pro-vide a reason for the second line in Figure3. One could build a production that woulddirectly recognize the application of the def-inition of this situation rather than go throughthe interpretive rigmarole of Figure 5 (Table2). This production would have the form

IF the goal is to give a reason for XY = UVand a previous line has XY S UV,

THEN POP with success,and the reason is definition of segment congru-

ence.

However, it is very implausible that the sub-

ject could know that this knowledge wasneeded in this procedural form before he orshe stumbled on its use to solve Line 2 inFigure 3. Thus, ACT should not be expectedto encode its knowledge into procedures untilit has seen examples of how the knowledgeis to be used.

Although new productions have to be cre-ated sometimes, forming new productions ispotentially dangerous. Because productionshave direct control over behavior, there isthe ever-present danger that a new produc-tion may wreak great havoc in a system.Anyone who incrementally augments com-puter programs will be aware of this prob-lem. A single erroneous statement can de-stroy the behavior of a previously fineprogram. In computer programming the costis slight—one simply has to edit out the bugsthe new procedure brought in. For an evolv-ing creature the cost of such a failure mightwell be death. In the next section I will de-scribe a highly conservative and adaptiveway of entering new procedures.

As the examples reviewed in this sectionillustrate, declarative knowledge can haveimpact on behavior, but that impact is fil-tered through an interpretive system that iswell oiled in achieving the goals of the sys-tem. This does not guarantee that new learn-ing will not result in disaster, but it doessignificantly lower the probability. If a newpiece of knowledge proves to be faulty, it canbe tagged as such and so disregarded. It ismuch more difficult to correct a faulty pro-cedure.

As a gross example suppose I told a gul-lible child, "If you want something, then youcan assume it has happened." Translatedinto a production it would take on the fol-lowing form:

IF the goal is to achieve X,THEN POP with X achieved.

This would lead to a perhaps blissful butdeluded child who never bothered to try toachieve anything because he or she believedit was already achieved. As a useful cognitivesystem he or she would come to an imme-diate halt. However, even if the child weregullible enough to encode this in declarativeform at face value and perhaps even act uponit, he or she would quickly identify it as a


lie (by contradiction procedures he or shehas), tag it as such, and so prevent it fromhaving further impact on behavior and con-tinue on a normal life of goal achievement.New information should enter in declarativeform because one can encode informationdeclaratively without committing control toit and because one can be circumspect aboutthe behavioral implications of declarativeknowledge.

In earlier publications (e.g., Anderson etal., 1979, 1980) we proposed a designationprocess that allowed productions to be di-rectly created. Basically, we could build anarbitrary knowledge structure in workingmemory and, by the action of a single des-ignating production, convert that knowledgestructure into a production. This was right-fully criticized (e.g., Norman, 1980) as toopowerful computationally to be human. Itmeant, for instance, that one could directlycommit to memory production rules for ap-plying a novel procedure. For instance, asubject given a target set in the Sternberg(1969) task could designate specific produc-tions to recognize each member of the setand so avoid any effect of set size. We werealways aware of such problems with desig-nation. For instance, in my discussion of in-duction in the 1976 ACT book (Anderson,1976, Section 12.3), I was stubbornly avoid-ing such a process. However, a few years agothere seemed no way to construct a learn-ing theory without such a mechanism.Now, thanks to the development of ideasabout knowledge compilation, the designationmechanism is no longer necessary.

Knowledge Compilation

Interpreting knowledge in declarative formhas the advantage of flexibility, but it alsohas serious costs in terms of time and work-ing memory space. The process is slowbecause interpretation requires retrievalsof declarative information from long-termmemory and because the individual produc-tion steps of an interpreter are small in orderto achieve generality. (For instance, thesteps of problem refinement in Table 2 andFigure 5 were painfully small.) The interpre-tive productions require that the declarativeinformation be represented in working mem-

POSTULATE 14 If two sides and the included angle of one(SAS POSTULATE) triangle are congruent to the corresponding

parts of another triangle, the triangles are con-gruent.

According to Postulate 14:

If AB a DE. ACss OF, and LA s, LD.then A/4BC a t\DEF.

Figure 6. Statement in the text of the side-angle-side(SAS) postulate. (From Geometry by R. C. Jurgensen,A. J. Donnelly, J. E. Maier, and G. R. Rising. Boston:Houghton Mifflin, 1975, p. 122. Copyright 1975 byHoughton Mifflin Co. Reprinted by permission.)

ory and this can place a heavy burden onworking memory capacity. Many of the sub-jects' errors and much of their slowness seemattributable to working memory errors. Stu-dents can be seen to repeat themselves overand over again as they lose critical inter-mediate results and have to recompute them.This section of the paper is devoted to de-scribing the compilation processes by whichthe system goes from this interpretive ap-plication of declarative knowledge to pro-cedures (productions) that directly apply theknowledge. By building up procedures toperform specific tasks like reason giving ingeometry, a great deal of efficiency isachieved both in terms of time and workingmemory demands.

The Phenomenon of Compilation

One of the processes in geometry that Ihave focused on is how students match pos-tulates against problem statements. Con-sider the side-angle-side (SAS) postulatewhose presentation in the text is given inFigure 6. I followed a student through theexercises in the text that followed the sectionthat contained this postulate and the side-side-side (SSS) postulate. The first problemthat required use of SAS is illustrated inFigure 7. The following is the portion of hisprotocol where he actually called up thispostulate and managed to put it in corre-spondence to the problem:

If you looked at the side-angle-side postulate [longpause] well RK and RJ could almost be [long pause]what the missing [long pause] the missing side. I think


Given: L\ and L2 are right anglesJS = KS

Prove: ARSJ = ARSK

Figure 7. The first proof-generation problem that a stu-dent encounters that requires application of the side-angle-side postulate.

somehow the side-angle-side postulate works its way intohere [long pause]. Let's see what it says: "two sides andthe included angle." What would I have to have to havetwo sides. JS and KS are one of them. Then you couldgo back to RS = RS. So that would bring up the side-angle-side postulate [long pause]. But where would L\and L2 are right angles fit in [long pause] wait I seehow they work [long pause] JS is congruent to KS [longpause] and with Angle 1 and Angle 2 are right anglesthat's a little problem [long pause]. OK, what does itsay—check it one more time: "If two sides and the in-cluded angle of one triangle are congruent to the cor-responding parts." So I have got to find the two sidesand the included angle. With the included angle you getAngle 1 and Angle 2. I suppose [long pause] they areboth right angles, which means they are congruent toeach other. My first side is JS is to KS. And the nextone is RS to RS. So these are the two sides. Yes, I thinkit is the side-angle-side postulate.

After reaching this point there was stilla long process by which the student actuallywent through writing out the proof, but thisis the relevant portion in terms of assessingwhat goes into recognizing the relevanceof SAS.

After a series of four more problems (twowere solved by SAS and two by SSS), I cameto the student's application of the SAS pos-tulate for the problem illustrated in Figure8. The method recognition portion of theprotocol follows:

Right off the top of my head I am going to take aguess at what I am supposed to do: £DCK s AABK.There is only one of two and the side-angle-side pos-tulate is what they are getting to.

A number of things seem striking about thecontrast between these two protocols. Oneis that there has been a clear speedup in the

application of the postulate. A second is thatthere is no verbal rehearsal of the statementof the postulate in the second case. I takethis as evidence that the student is no longercalling a declarative representation of theproblem into working memory. Note also inthe first protocol that there are a number offailures of working memory—points wherethe student recomputed forgotten informa-tion. The third feature of difference is thatin the first protocol there is a clear piecemealapplication of the postulate by which thestudent is separately identifying every ele-ment of the postulate. This is absent in thesecond protocol. It gives the appearance ofthe postulate being matched in a single step.These three features—speedup, dropout ofverbal rehearsal, and elimination of piece-meal application—are among the featuresthat I want to associate with the processesof knowledge compilation.

The Mechanisms of Compilation

The knowledge compilation processes inACT can be divided into two subprocesses.One, which is called composition, takes se-quences of productions that follow eachother in solving a particular problem and

B

D

Given: L\ S LiSis DCBKS CK

Prove : AABKSADCKFigure 8. The fourth proof-generation problem that astudent encounters that requires application of the side-angle-side postulate.


collapses them into a single production thathas the effect of the sequence. The idea ofcomposition was first developed by Lewis(1978). This produces considerable speedupby creating new operators that embody thesequences of steps that are used in a partic-ular problem domain. The second process,proceduralization, builds versions of the pro-ductions that no longer require the domain-specific declarative information to be re-trieved into working memory. Rather, theessential products of these retrieval opera-tions are built into the new productions.

The technical details about how knowl-edge compilation is implemented are givenin Neves and Anderson (1981). Here I willsimply present the basic ideas and then focuson assessing some of the relevant literature.I will explain the basic processes with respectto an interesting example of telephone num-bers. It has been noted (Anderson, 1976)that people develop special procedures fordialing frequently used telephone numbers.Sometimes declarative access to the numberis lost and the only access one has to thenumber is through the procedure for dial-ing it.

Consider the following two productionsthat might serve to dial a telephone number:

PI. IF the goal is to dial LVtelephone-numberand LVdigitl is the first digit of LVtele-

phone-number,THEN dial LVdigitl.

P2. IF the goal is to dial LVtelephone-numberand LVdigitl has just been dialedand LVdigit2 is after LVdigitl in LV-

telephone-number,THEN dial LVdigitZ

Composition creates "macroproductions,"which do the operation of a pair of produc-tions that occurred in sequence. Applied tothe sequence of Production PI above fol-lowed by P2, composition would create

P1&P2. IF the goal is to dial LVtelephone-numberand LVdigitl is the first digit of LV-

telephone-numberand LVdigit2 is after LVdigitl,

THEN dial LVdigitl and then LVdigit2.

Compositions like this reduce the number ofproduction applications to perform the task.

A composed production like P1&P2 stillrequires that the information (in this case,the phone number) be held in working mem-

ory. This information must be retrieved fromlong-term memory and matched to the sec-ond and third clauses in P1&P2. Procedur-alization eliminates clauses in the conditionof a production that require information tobe retrieved from long-term memory andheld in working memory. In the above pro-duction, P1&P2, the second and third con-dition clauses would be eliminated. The localvariables that would have been bound inmatching these clauses are replaced by thevalues they are bound to in the special case.So suppose this production is repeatedly ap-plied in the dialing of Mary's telephone num-ber, which is 432-2815. The local variablesin P1&P2 would be bound as follows:

LVtelephone-number = Mary's number.LVdigitl = 4.LVdigit2 = 3.

Producing the substitution of these valuesfor the variables and eliminating the secondand third condition clauses we get

P1&P2*. IF the goal is to dial Mary's telephonenumber,

THEN dial 4 and then 3.

By continued composition and procedural-ization, a production can be built that dialsthe full number.

P*. IF the goal is to dial Mary's number,THEN dial 432-2815.

It should be emphasized that forming thisproduction does not imply the necessary lossof the declarative representation of theknowledge. The few instances reported ofloss of declarative access to a telephone num-ber probably reflect cases where the declar-ative knowledge ceases to be used and is sim-ply forgotten.

An important consequence of procedur-alization is that it reduces the load on work-ing memory in that the long-term informa-tion need no longer be held in workingmemory. This makes it more likely that thesystem can simultaneously perform a secondtask that does make working memory de-mands. Of course, this is achieved by cre-ating a procedure that is knowledge specific.The original Productions PI and P2 coulddial any telephone number. P* can only dialMary's number.


It should be clear from this example howknowledge compilation produces the threephenomena noted in the protocols at the in-troduction of this section. The compositionof multiple steps into one produces thespeedup and leads to unitary rather thanpiecemeal application. The dropout of verbalrehearsal is a result of the fact that proce-duralization eliminates the need to holdlong-term memory information in workingmemory.

An important issue concerns the limits onthe composition process. How many smallproductions can be combined to form a largeone? The limit comes from the capacity ofworking memory. All the information in theproduction's condition must be active inworking memory for the production to apply.If a composed production is created with acondition too large to be matched by workingmemory, that production will never applyand so will not be able to enter into furthercompositions. It should be noted that pro-ceduralization serves to reduce the demandsmade by a production on working memory.Hence, proceduralized versions of produc-tions may be able to enter into more com-positions than nonproceduralized forms.However, even when proceduralized, theconditions of productions will grow with fur-ther compositions and hence there will be alimit on the amount of composition. I, willdiscuss later the potential for practice to in-crease the capacity of working memory andso permit productions, which had been toolarge to match, to match and be composed.

Remarks About the CompositionMechanism

In the above discussion and in the com-puter implementation, the assumption hasbeen that a pair of productions will be com-posed if they follow each other. This meansthat on repeated applications of the sameproblem, the number of productions shouldbe halved each time. More generally, how-ever, one might assume that the number ofproductions in each application is reducedto a proportion a of the previous applicationthat involved composition. If a > 1/2, thismight reflect that compositions are formedwith probability less than 1. If a < 1/2, this

might reflect the fact that composition in-volved more than a pair of productions.Thus, after n compositions the expectednumber of productions would be Ate", whereTV was the initial number. As will be arguedlater, the rate of composition («) may notbe linear in number of applications of theproduction set to problems.

There is the opportunity for spurious pairsof productions to accidently follow eachother and so be composed together. If spu-rious pairs of productions were allowed tobe composed together, there would not bedisastrous consequences but it would bequite wasteful. Also, spurious productionsmight intervene between the application ofproductions that really belong together. So,for instance, suppose the following three pro-ductions had happened to apply in sequence:

PI. IF the subgoal is to add in a digit,THEN set as a subgoal to add the digit and the

running total.

P2. IF I hear footsteps in the aisle,THEN the teacher is coming my way.

P3. IF the goal is to add two digitsand a sum is the sum of the two digits,

THEN the result is the sumand POP.

This sequence of productions might apply,for instance, as a child is performing arith-metic exercises in a class. The first and thirdare set to process subgoals in solving theproblem. The first sets up the subgoal thatis met by the third. The second productionis not related to the other two and is merelyan inference production that interpretssounds of the teacher approaching. It justhappens to intervene between the other two.Composition as described would produce thefollowing pairs:

P1&P2. IF the subgoal is to add in a digitand I hear footsteps in the aisle,

THEN set as a subgoal to add the digit andthe running total

and the teacher is coming my way.

P2&P3. IF I hear footsteps in the aisleand the goal is to add two digitsand a sum is the sum of the two digits,

THEN the teacher is coming my wayand the result is the sumand POP.

These productions are harmless but basicallyuseless. They have also prevented formation


of1 the following useful composition:

P1&P3. IF the subgoal is to add in a digitand the sum is the sum of the digit

and the running total,THEN the result is the sum.

Therefore, it seems reasonable to advancea sophistication over the composition mech-anism proposed in Neves and Anderson(1981). In this new scheme productions arecomposed only if they are linked by goal set-ting (as in the case of P1&P3), and pro-ductions that are linked by goal setting willbe composed even if there are interveningproductions that make no goal reference (asin the case of P2). This is an example wherethe learning mechanisms can profitably ex-ploit the goal structuring of production sys-tems.

Phenomena Explained by KnowledgeCompilation

In addition to the three qualitative fea-tures of skill development (speedup, unitaryapplication, elimination of verbal rehearsal),knowledge compilation can help explainsome of the more provocative results in theexperimental literature. In this section, Iwould like to focus on three such results:disappearance of set size effects in the Stern-berg (1969) paradigm, disappearance of setsize and display size effects in the scan task,(Shiffrin & Schneider, 1977), and the Ein-stellung effect (Luchins, 1942) in problemsolving.

Before we get into the specifics of ACT'sexplanation of these three phenomena, itshould be acknowledged that there alreadyexist other explanations of these phenomenain the literature. However, ACT's explana-tion relies on mechanisms that were notfashioned to address these phenomena.Rather, they were fashioned by Neves andAnderson (1981) to address various phenom-ena associated with the speedup of postulateapplication in geometry. Thus, the fact thatthe compilation mechanisms extend to theseother phenomena is an important demon-stration of the generality of the theory.

The Sternberg paradigm. In the Stern-berg paradigm (e.g., Sternberg, 1969) sub-jects are asked to indicate if a probe comesfrom a small set of items. The classic result

is that decision time increases with set size.It has been shown that effects of size ofmemory set can diminish with repeated prac-tice (Briggs & Blaha, 1969). A sufficientcondition for this to occur is that the samememory set be used repeatedly. The follow-ing are two productions that a subject mightuse for performing the scan task at the be-ginning of the experiment:

PA. IF the goal is to recognize LVprobeand LVprobe is an LVtypeand the memory set contains an LVitem

of LVtype,THEN say "yes"

and POP the goal.PB. IF the goal is to recognize LVprobe

and LVprobe is an LVtypeand the memory set does not contain an

LVitem of LVtype,THEN say "no"

and POP the goal.

In the above, LVprobe and LVitem willmatch tokens of letters and LVtype willmatch a particular letter type (e.g., the letterA). This production set is basically the sameas the production system for the Sternbergtask given in Anderson (1976) except thatit is in a somewhat more readable form thatwill expose the essential character of theprocessing. These productions require thatthe contents of the memory set be held activein working memory. As discussed in Ander-son (1976), the more items required to beheld active in working memory, the lowerthe activation of each and the slower therecognition judgment, which produces thetypical set-size effect.

Repeated practice of those productions onthe same memory set will produce an even-tual elimination of set-size effects. Consider,to be concrete, a situation where these pro-ductions are repeatedly applied on the samelist—say a list consisting of A, J, and N withfoils coming from a list of L, B, and K—then through proceduralization we would getthe following productions from PA:

PI. .IF the goal is to recognize LVprobeand LVprobe is an A,

THEN say "yes"and POP the goal.

P2. IF the goal is to recognize LVprobeand LVprobe is a J,



P3. IF the goal is to recognize LVprobeand LVprobe is an N,


The preceding are productions for recogniz-ing the positive set. Specific productionswould also be produced by proceduralizationfrom PB to reject the foils:

P4. IF the goal is to recognize LVprobeand LVprobe is an L,

THEN say "no"and POP the goal.

P5. IF the goal is to recognize LVprobeand LVprobe is a B,


P6. IF the goal is to recognize LVprobeand LV probe is a K,


It is interesting to note here that Shiffrinand Dumais (1981) report that the auto-matization effect they observe in such tasksis as much due to subjects' ability to rejectspecific foils as it is their ability to acceptspecific targets. These productions no longerrequire the memory set to be held in workingmemory and will apply in a time independentof memory set size. However, there still maybe some effect of set size in the subject'sbehavior. These productions do not replacePA and PB; rather, they coexist and it ispossible for a classification to proceed by theoriginal PA and PB. Thus, we have two par-allel bases for classification racing, with thejudgment being determined by the fastestone. This will produce a set-size effect thatwill diminish as P1-P6 become strength-ened.

The scan task. Shiffrin and Schneider(1977) report an experiment in which theygave subjects a set of items to remember.Then subjects were shown a series of displaysin rapid succession, each display containinga set of items. The subjects' task was to de-cide if any of the displays contained an itemin the memory set. When Shiffrin andSchneider kept both the members of thestudy set and the distractors constant, theyfound considerable improvement with prac-tice in subjects' performance on the task.They interpreted their results as indicatingdiminishing effects of both memory set size

and the number of alternatives in the dis-play.

Again ACT's knowledge compilation mech-anisms can be shown to predict the result.Consider what a production set might be likethat scanned an array to see if any memberof the array matched a memory set item:

PC*. IF the goal is to see if LVarray contains amemory item

and LVprobe is in POSITION*,THEN the subgoal is to recognize LVprobe.

PD. IF the goal is to recognize LVprobeand LVprobe is an LVtypeand the memory set contains LVitem of

LVtype,THEN tag the goal as successful

and POP the goal.PE. IF the goal is to recognize LVprobe

and LVprobe is an LVtypeand the memory set does not contain an

LVitem of LVtype,THEN tag the goal as failed

and POP the goal.PF. IF the goal is to see if LVarray contains a

memory itemand there is a successful subgoal,


PG. IF the goal is to see if LVarray contains amemory item,


Production PC* is a schema for a set of pro-ductions such that each one would recognizean item in a particular position. An examplemight be

IF the goal is to see if LVarray contains a memoryitem

and LVprobe is in the upper right corner,THEN set as a subgoal to recognize LVprobe.

PD and PE are similar to PA and PB givenearlier—they check whether each positionfocused by PC* contains a match. PF willapply if one of the probes lead to a successfulmatch, and the default production PG willapply if none of the positions leads to suc-cess. The behavior of this production set isone in which individual versions of PC* ap-ply serially, focusing attention on individualpositions. PD and PE are responsible for thejudgment of individual probes. This contin-ues until a positive probe is hit and PF ap-plies or until there are no more probe posi-tions and PG applies. (PG will only beselected when there are no more positions


because specificity will prefer PC* and PFover it.) Because of the need to keep thememory set active, an effect of set size isexpected. The serial examination of positionsproduces an effect of display size. These twofactors should be multiplicative, which iswhat Schneider and Shiffrin (1977) report.

Consider what will happen with knowl-edge compilation. Composing a PC* pro-duction with PD and with PF and proce-duralizing, we will get positive productionsof the form

P7. IF the goal is to see if LVarray contains amemory item

and the upper right-hand position containsan LVprobe

and the LVprobe is an A,THEN say "yes"

and POP the goal.

The negative production would be formedby composing together a sequence of PC*productions paired with PE and a final ap-plication of PG. All the subgoal and POP-ping would be composed out. The strict com-position of this sequence would be pro-ductions like

P8. IF the goal is to see if LVarray contains amemory item

and the upper left-hand position containsan LVprobe 1

and LVprobe 1 is a Kand the upper right-hand position contains

an LVprobe2and LVprobe2 is a Band the lower left-hand position contains

an LVprobe3and LVprobeB is an Land the lower right-hand position contains

an LVprobe4and LVprobe is a K,

THEN say "no"and POP the goal,

where a separate such production wouldhave to be formed for each possible foil com-bination. These productions would not beaffected by set size or probe size. This isconsistent with the Schneider and Shiffrin(1977) findings.

The Einstellung phenomenon. Anotherphenomenon that is predictable from knowl-edge compilation processes is the Einstellungeffect (Luchins, 1942; Luchins & Luchins,1959) in problem solving. One of the ex-amples used by Luchins to demonstrate thisphenomena is illustrated in Figure 9 (a).

(a) GivemOMs^PNMP»NO

(b) Given: AC a CDBC3CE5@ a DE

Prove: £BCA a £OCE

Figure 9. After solving a series' of problems like (a),students are more likely to choose the nonoptimal so-lution for (b).

Luchins presented his subjects with a se-quence of geometry problems like the onein Figure 9 (a). For each problem in thesequence, the student had to prove that twotriangles were congruent in order to provethat two angles were congruent. Then sub-jects were given a problem like the one inFigure 9 (b). Subjects proved this by meansof congruent triangles even though it has amuch simpler proof by means of verticalangles. Subjects not given the initial expe-rience with problems like the one in Figure9 (a) show a much greater tendency to usethe vertical angle proof. Their experimentalexperience caused subjects to solve the prob-lem in a nonoptimal way.

Lewis (1978) has examined the Einstel-lung effect and its relation to the compositionprocess. He defines as perfect compositescompositions that do not change the behav-ior of the system but just speed it up. Suchcompositions cannot produce Einstellung ef-fects, of course. However, he notes that thereare a number of natural ways to producenonperfect composites that produce Einstel-lung effects. The ACT theory provides anexample of such a nonperfect compositionprocess. Composites are nonperfect in ACTbecause of its conflict-resolution principles.

Productions PI through P4 provide amodel of part of the initial state of the stu-dent's production system.

PI. IF the goal is to prove AXYZ s AUVWand the points are ordered X, Y, and W

on a lineand the points are ordered Z, Y, and U on

a line,


THEN this can be achieved by vertical anglesand POP the goal.

P2. IF the goal is to prove S.XYZ s AUVW,THEN set as subgoals:

1. To find a triangle that contains AXYZ,2. To find a triangle that contains AUVW,3. To prove the two triangles congruent,

and4. To use corresponding parts of congruent

triangles.P3. IF the goal is to find a figure that has a relation

to an objectand Figure X has the relation to the object,

THEN the result is Figure Xand POP the goal.

P4. IF the goal is tojprove AXYZ s AUVWand XY g UVand YZ ss VWand ZX s WU,

THEN this can be achieved by SSSand POP the goal.

Production PI is responsible for immediatelyrecognizing the applicability of the verticalangle postulate. Productions P2-P4 are partof the production set that is responsible forproof through the route of correspondingparts of congruent triangles. Production P2decomposes the main goal into the subgoalsof finding the containing triangles, of prov-ing they are congruent, and then of using thecorresponding-parts principle. P3 finds thecontaining triangles, and P4 encodes oneproduction that would recognize trianglecongruence. This production set, applied toa problem like that in Figure 9 (b), wouldlead to a solution by vertical angles. This isbecause Production PI, for vertical angles,is more specific in its condition than is Pro-duction P2, which starts off the correspond-ing-angles proof. As explained earlier, ACT'sconflict resolution prefers specific produc-tions.

Consider, however, what would happenafter Productions P2-P4 had been exercisedon a number of problems and compositionhad taken place. Production P2&P3&P3&P4 represents a composition of the sequenceP2, then P3, then P3, and then P4. Its con-dition is not less specific than that of PI and,in fact, contains more clauses. However, be-cause these clauses are not a superset of Pi'sclauses, it is not the case that either pro-duction is technically more specific than theother. They are both in potential conflictand, because both change the goal state,

application of one will block the applicationof the other. In this case strength serves asthe basis for resolving the conflict. Produc-tion P2&P3&P3&P4, because of its recentpractice, may be stronger and thereforemight be selected.

P2 &P3 &P3 &P4. IF the goal is to prove AXYZ sAUVW

and AXYZ is part of AXYZand AUVWjspart of AUVWand XY a UVand YZ s VWand ZX s WU,

THEN set AXYZ & AUVWand set as a subgoal to use

corresponding parts ofcongruent triangles.

This example illustrates how practice canchange the specificity ordering of produc-tions through composition and how it candirectly change the strength. These two fac-tors—change of specificity and change ofstrength—can cause ACT's conflict-resolu-tion mechanism to change the behavior ofthe system, producing Einstellung effects.Under this analysis it can be seen that theEinstellung effect is an aberrant phenome-non reflecting what is basically an adaptiveadjustment on the system's part. Throughstrength and composition ACT is unitizingand favoring sequences of problem-solvingbehaviors that have been recently successful.It is a good bet that such sequences willprove useful again. It is to the credit of thecleverness of Luchins's design (1942, Lu-chins & Luchins, 1959) that it exposed thepotential cost of these usually beneficial ad-aptations.

It has been suggested that one could pro-duce the Einstellung effect by simplystrengthening particular productions. Soone might suppose that Production P2 isstrengthened over PI. The problem with thisexplanation is that subjects can be shown tohave a preference for a particular sequenceof productions not merely single productionsin isolation. Thus, in the water jug problems,described by Luchins (1942), subjects willfixate on a specific sequence of operators im-plementing a subtraction method and willnot notice other simpler subtraction meth-ods. The composition mechanism explainshow the subject encodes this operator se-quence.


It is interesting to compare the time scalefor producing Einstellung effects with thetime scale for producing the automatizationeffects in the Sternberg paradigm and thescan paradigm. Strong Einstellung effectscan be produced after a half-dozen trials,whereas the automatization results requirehundreds of trials. This suggests that com-position that underlies the Einstellung effectcan proceed more rapidly than the proce-duralization that underlies the automatiza-tion effects. Proceduralization is really moreresponsible for creating domain-specific pro-cedures than is composition. Compositioncreates productions that encode the sequenceof general productions for a task, but thecomposed productions are still general. Incontrast, by replacing variables with domainconstants, proceduralization creates produc-tions that are committed to a particular task.Apparently, the learning system is reluctantto create this degree of specialization unlessthere is ample evidence that the task will berepeated frequently.

The Adaptive Value of KnowledgeCompilation

In the previous section on initial encoding,it was argued that it is dangerous for a sys-tem to directly create productions to embodyknowledge. For this reason and for a numberof others, it was argued that knowledgeshould first be encoded declaratively andthen interpreted. This declarative knowledgecould affect behavior, but only indirectly, viathe intercession of existing procedures forcorrectly interpreting that knowledge. Wehave in the process of composition and pro-ceduralization a means of converting de-clarative facts into production form.

It is important to note that productionscreated from compilation really do notchange the behavior of the system, exceptin terms of possible reorderings of specificityrelations as noted in our discussion of theEinstellung effect. Thus, knowledge com-piled in this way has much of the same safe-guards built into it that interpretative ap-plication of the knowledge does. The safetyin interpretative applications is that a par-ticular piece of knowledge does not impacton behavior until it has undergone the scru-

tiny of all the system's procedures (whichcan, for instance, detect contradiction offacts or of goals). Because compilation onlyoperates on successful sequences of produc-tions that pass this scrutiny, it tends to pro-duce only production embodiments of knowl-edge that pass that scrutiny. This is theadvantage of learning from doing. Anotheradvantage with interpretive application isthat the use of the knowledge is forced tobe consistent with existing conventions forpassing control among goals. By compilingfrom actual use of this knowledge, the com-piled productions are guaranteed to be like-wise consistent with the system's goal struc-ture.

We can understand why human compi-lation is gradual (in contrast to computercompilation) and occurs as a result of prac-tice if we consider the difference between thehuman situation and the typical computersituation. For one thing the human does notknow what is going to be procedural in aninstruction until he or she tries to use theknowledge in the instruction. In contrast, thecomputer has built in the difference betweenprogram and data. Another reason for grad-ual compilation is to provide some protectionagainst the errors that enter into a compiledprocedure because of the omission of con-ditional tests. For instance, if the system isinterpreting a series of steps that includepulling a lever, it can first reflect on the lever-pulling step to see if it involves any unwantedconsequences in the current situation. Thesetests will be in the form of productionschecking for error conditions. (These error-checking productions can be made more spe-cific so that they would take precedence overthe normal course of action.) When thatprocedure is totally compiled, the lever pull-ing will be part of a prepackaged sequenceof actions with many conditional tests elim-inated (see the discussion of the Einstellungeffect). If the procedure transits graduallybetween the interpretive and compiled stages,it is possible to detect the erroneous com-piling out of a test at a stage where the be-havior is still being partially monitored in-terpretively and can be corrected. It isinteresting to note here the folk wisdom thatmost errors in acquisition of a skill, like air-plane flying, occur neither with the novices


nor with experts. Rather, they occur at in-termediate stages of learning. This is pre-sumably where the conversion from proce-dural to declarative is occurring and thepoint where unmonitored mistakes mightslip into the performance. So by makingcompilation gradual, one does not eliminatethe possibility of error, but one does reducethe probability.

Procedural Learning: TuningMuch learning goes on after a skill has

been compiled into a task-specific procedure,and this learning cannot just be attributedto further speedup due to more composition.One type of learning involves an improve-ment in the choice of method by which thetask is performed. All tasks can be charac-terized as having a search associated withthem, although in some cases the search istrivial. By search I mean that there are al-ternate paths of steps by which the problemcan be tackled, and the subject must chooseamong them. Some of these paths lead to nosolution and some lead to more complex so-lutions than necessary. A clear implicationof much of the novice-expert research (e.g.,Larkin, McDermott, Simon, & Simon, 1980)is that with high levels of expertise in a taskdomain, the problem solver becomes muchmore judicious in his choice of paths andmay fundamentally alter his method ofsearch. In terms of the traditional learningterminology, the issue is similar to, thoughby no means identical to, the issue of trialand error versus insight in problem solving.A novice's search of a problem space islargely a matter of trial-and-error explora-tion. With experience the search becomesmore selective and more likely to lead torapid success. I refer to the learning under-lying this selectivity as tuning. My use of theterm is quite close to that of Rumelhart andNorman (1978).

In 1977 we proposed a set of three learn-ing mechanisms that still serves as the basisfor much of our work on the tuning of search(Anderson, Kline, & Beasley, Note 2).There was a generalization process by whichproduction rules become broader in theirrange of applicability, a discrimination pro-cess by which the rules become narrower,

and a strengthening process by which betterrules are strengthened and poorer rulesweakened. These ideas have nonaccidentalrelations to concepts in the traditional learn-ing literature, but as we will see they havebeen somewhat modified to be computation-ally more adequate. One can think of pro-duction rules as implementing a search,where individual rules correspond to individ-ual operators for expanding the search space.Generalization and discrimination serve toproduce a "metasearch" over the productionrules, looking for the right features to con-strain the application of these productions.Strength serves as an evaluation for the var-ious constraints produced by the other twoprocesses.

In this section I will illustrate how thesethree central learning constructs operate inthe ACT system with examples from lan-guage acquisition. It is a major claim of thetheory that these learning mechanisms willapply equally well to domains as diverse aslanguage processing and geometry proofgeneration. Elsewhere we have discussedhow these processes apply to schema ab-straction or prototype formation (Andersonet al, 1979) and to proof generation (An-derson, Greeno, Kline, & Neves, 1981).That they apply in such diverse circum-stances is important evidence for the gen-erality of these mechanisms.

In Anderson (198la) I focused on the is-sue of language acquisition per se and thatarticle should be consulted for a fuller dis-cussion of these issues. Language acquisitionis being used here to illustrate the basiclearning mechanisms. The examples herewill concern how production rules are ac-quired to generate the correct syntacticstructures. Unlike some of the other domains(e.g., proof generation), productions for lan-guage generation seldom result in one pathfor generation being tried and then a secondpath being tried after the first fails. Rather,in the language generation case, if the wronggeneration path is followed, a faulty syntac-tic structure will be generated. Thus, errorsof choice or search in language generationresult in incorrect generations. In proof gen-eration they would more likely result inlonger times to reach a solution.


Generalization

The ability to perform successfully innovel situations is the hallmark of humancognition. For example, productivity hasoften been identified as the most importantfeature of natural languages, where this re-fers to the speaker's ability to generate andcomprehend utterances never before encoun-tered. Traditional learning theories havebeen criticized because of their inability toaccount for this productivity (e.g., McNeill,1968), and it was one of our goals in de-signing ACT to avoid this sort of criticism.

An example. ACT's generalization al-gorithm looks for similarities between a pairof productions and creates a new productionrule that captures what these individual pro-duction rules have in common. Consider thefollowing pair of rules for language gener-ation that might arise as the consequence ofcompiling productions to encode specific in-stances of phrases:

PI. IF the goal is to indicate that a coat belongsto me,

THEN say "My coat."P2. IF the goal is to indicate that a ball belongs

to me,THEN say "My ball."

From these two production rules, ACT canform the following generalization:

P3. IF the goal is to indicate that LVobject be-longs to me,

THEN say "My LVobject,"

in which the variable LVobject has replacedthe particular object. The rule now formedis productive in the sense that it will fill inthe LVobject slot with any object. Of course,it is just this productivity in child speech thathas been commented on, at least since Braine(1963). It is important to note that the gen-eral production does not replace the originaltwo and that the original two will continueto apply in their special circumstances.

The basic function of the ACT general-ization process is to extract from differentspecial productions what they have in com-mon. These common aspects are embodiedin a production that will apply in new situ-ations where original special productions donot apply. Thus, the claim for the ACT gen-

eralization mechanism is that transfer is fa-cilitated if the same components are taughtin two procedures so generalization can oc-cur. So, for instance, transfer to a new texteditor will be more facilitated if one hasstudied two other text editors than if one hasstudied only one.

Another example. The example abovedoes not illustrate the full complexity at is-sue in forming generalizations. The follow-ing is a fuller illustration of the complexity:P4. IF the goal is to indicate the relation in (LVob-

ject 1 chase LVobject2)and LVobject 1 is dogand LVobject 1 is singularand LVobject2 is catand LVobject2 is plural,

THEN say "CHASES."P5. IF the goal is to indicate the relation in (LVob-

ject3 scratch LVobject4)i and LVobjectS is cat

and LVobjectS is singularand LVobject4 is dogand LVobject4 is plural,

THEN say "SCRATCHES."P6. IF the goal is to indicate the relation in (LVob-

jectl LVrelation LVobject2)and LVobject 1 is singularand LVobject2 is plural,

THEN say "LVrelation + s."

P6 is the generalization that would beformed from P4 and P5. It illustrates thatclauses can be deleted in a generalization aswell as variables introduced (in this caseLVrelation). In this example, the general-ization has been made that the verb inflec-tion does not depend on the category of thesubject or of the object and does not dependon the verb. This generalization remainsoverly specific in that the rule still testswhether the object is plural—this is some-thing the two examples have in common.Further generalization would be required todelete this unnecessary test. On the otherhand, the generalized rule does not test forpresent tense and so is overly general. Thisis because this information was not repre-sented in the original productions. The dis-crimination process (to be described later)can bring in this missing information.

The technical work defining generaliza-tion in ACT is given in Anderson et al.(1980) and similar definitions are to befound in Hayes-Roth and McDermott (1976)


and Vere (1978). I will skip these technicaldefinitions here for brevity's sake. The basicgeneralization process is clear without them.

Comparisons to earlier conceptions ofgeneralization. The process of productiongeneralization clearly has similarities to theprocess of stimulus generalization in earlierlearning theories (for a review see Heine-mann & Chase, 1975), but there are alsoclear differences. Past theories frequentlyproposed that a response conditioned to onestimulus would generalize to stimuli similaron various dimensions. So, for instance, abar press conditioned to one tone would tendto be evoked by other tones of similar pitchand loudness. An important feature of thisearlier conception is that generalization wasan automatic outcome of a single learnedconnection and did not require any furtherlearning. Learning in these theories was alla matter of discrimination—restricting therange of the learned response. In contrast,in the ACT theory generalization is an out-come of comparing two or more learned rulesand extracting what they have in common.Thus, it requires additional learning overand above the learning of the initial rules,and it depends critically on the relation be-tween the rules learned. There is evidence(Elio & Anderson, 1981) for ACT's strongerassumption that generalization depends onthe interitem similarity among the learningexperiences as well as the similarity of thetest situation to the learning experiences.

Another clear difference between gener-alization as presented here'and many earliergeneralization theories is that the currentgeneralization proposed is structural and in-volves clause deletion and variable creationfather than the creation of ranges on con-tinuous dimensions. We have focused onstructural generalizations because of thesymbolic domains that have been our con-cern. However, these generalization mech-anisms can be extended to apply to gen-eralization over intervals on continuousdimensions (Brown, 1977; Larson & Mich-alski, 1977). ACT's generalization ideas aremuch closer to what happens in stimulus-sampling theory (Burke & Estes, 1957;Estes, 1950), where responses conditionedto one set of stimulus elements can gener-alize to overlapping sets. This is the same

as the notion in ACT of generalization onthe basis of clause overlap. However, thereis nothing in stimulus-sampling theory thatcorresponds to ACT's generalization by re-placing constants in clauses with variables.This is because stimulus-sampling theorydoes not have the representational constructof propositions with arguments.

Discrimination

Just as it is necessary to generalize overlyspecific procedures, so it is necessary to re-strict the range of application of overly gen-eral procedures. It is possible for productionsto become overly general either because ofthe generalization process or because thecritical information was not attended to inthe first place. It is for this reason that thediscrimination process plays a critical rolein the ACT theory. This discrimination pro-cess tries to restrict the range of applicationof productions to just the appropriate cir-cumstances. The discrimination process re-quires that ACT have examples both ofcorrect and incorrect application of theproduction. The discrimination algorithmremembers and compares the values of thevariables in the correct and incorrect appli-cations. It randomly chooses a variable fordiscrimination from among those that havedifferent values in the two applications. Hav-ing selected a variable, it looks for some at-tribute that the variable has in only one ofthe situations. A test is added to the con-dition of the production for the presence ofthis attribute.

An example. Suppose ACT starts outwith the following production:PI. IF the goal is to indicate the relation in

(LVsubject LVrelation LVobject),THEN say "LVrelation + s."

This rule for generating the present tensesingular of a verb is, of course, overly generalin the above form. For instance, this rulewould apply when the sentence subject wasplural, generating "LVrelation + s,"'whenwhat is wanted is "LVrelation." By com-paring circumstances where the above ruleapplied correctly with the current incorrectsituation, ACT could notice that the variableLVsubject was bound to different values and


that the value in the correct situation hadsingular number but the value in the incor-rect situation had plural number. ACT canformulate a rule for the current situationthat recommends the correct action:

P2. IF the goal is to indicate the relation in(LVsubject LVrelation LVobject)

and LVsubject is plural,THEN say "LVrelation."

f

ACT can also form a modification of theprevious rule for the past situation:P3. IF the goal is to indicate the relation in

(LVsubject LVrelation LVobject)and LVsubject is singular,


The first discrimination, P2, is called an ac-tion discrimination because it involves learn-ing a new action, whereas the second dis-crimination, P3, is called a condition dis-crimination because it involves restrictingthe condition for the old action. Because ofspecificity ordering, the action discrimina-tion will block misapplication of the overlygeneral PI. The condition discrimination,P3, is an attempt to reformulate PI to makeit more restrictive. It is important to notethat these discriminations do not replace theoriginal production; rather, they coexistwith it.

ACT does not always form both actionand condition discriminations. ACT can onlyform an action discrimination when feed-back is obtained about the correct action forthe situation. If ACT only receives feedbackthat the old action is incorrect, it can onlyform a condition discrimination. However,ACT will only form a condition discrimi-nation if the old rule (i.e., PI in the aboveexample) has achieved a level of strength toindicate that it has some history of success.The reason for this restriction on conditiondiscriminations is that a rule can be for-mulated that is simply wrong and we do notwant to have it perserverate by a process ofendlessly proposing new discriminations.

Note that Productions P2 and P3 are im-provements over PI but are still not suffi-ciently refined. The discrimination algo-rithm can apply to these, however, comparingwhere they applied successfully and unsuc-cessfully. If discriminations of these produc-tions were formed on the basis of tense and

if both response and condition discrimina-tions were formed, we would have the fol-lowing set of productions:


and LVsubject is pluraland LVrelation has past tense,

THEN say "LVrelation + ed."P5. IF the goal is to indicate the relation in

(LVsubject LVrelation LVobject)and LVsubject is pluraland LVrelation has present tense,

THEN say "LVrelation."


and LVsubject is singularand LVrelation has past tense,

THEN say "LVrelation + ed."


and LVsubject is singularand LVrelation has present tense,


A more thorough consideration of how thesemechanisms would apply to acquisition ofthe verb auxiliary system of English is givenin Anderson (198la). The current exampleis only an illustration of the basic discrimi-nation mechanism.

Recall that the feature selected for dis-crimination is determined by comparing thevariable bindings in the successful and un-successful production applications. A vari-able on which they differ is selected, andfeatures are selected to restrict the bindings.It is possible for this discrimination mech-anism to choose the wrong variables orwrong features on which to discriminate. So,for instance, it may turn out that LVobjecthas a different number in two circumstances,and the system may set out to produce adiscrimination on that basis, rather than dis-criminating on the correct variable, LVsub-ject. In the case of condition discriminations,such mistakes have no negative impact onthe behavior of the system. The discrimi-nated production produces the same behav-ior as the original in the restricted situation,so it cannot lead to worse behavior. (Recallthat the original production still exists toproduce the same behavior in other situa-tions.) If an incorrect action discriminationis produced, it may block by specificity thecorrect application of the original production


in other situations. However, even here thesystem can recover by producing the correctdiscrimination and then giving the correctdiscrimination a specificity or strength ad-vantage over the incorrect discrimination.

The current discrimination mechanismalso attempts to speed up the process of find-ing useful discriminations by its method ofselecting propositions from the data base.Though a random process is used to guar-antee that any appropriate propositions inthe data will eventually be found, this ran-dom choice is biased in certain ways to in-crease the likelihood of a correct discrimina-tion. The discrimination mechanism choosespropositions with probabilities that vary withtheir activation levels. The greater theamount of activation that has spread to aproposition, the more likely it is that theproposition will be relevant to the currentsituation.

The previous example illustrated a criticalprerequisite for discrimination to work. Thesystem must receive feedback indicating thata particular production has misapplied, andin the case of an action discrimination, itmust receive feedback as to what the correctaction should have been.

In principle, a production application couldbe characterized as being in one of threestates: known to be incorrect, known to becorrect, or correctness unknown. However,the mechanisms we have implemented forACT do not use the distinction between thesecond and third states. If a production ap-plies and there is no comment on its success,it is treated as if it were a successful appli-cation. So the real issue is how ACT iden-tifies that a production application is in er-ror. A production is considered to be in errorif it puts into working memory a fact thatis later tagged as incorrect. There are twobasic ways for this error tagging to occur:one is through external feedback and theother is through internal computation. In theexternal-feedback situation, learners may bedirectly told that their behavior is in erroror they may infer this by comparing theirbehavior to an external referent (e.g., thebehavior of a model or a textbook answer).In the internal-computation case, the learnermust identify that a fact is contradictory,that a goal has failed, or that there is someother failure to meet internal norms. As dis-

cussed in Anderson (1981b), the goal struc-ture of ACT productions is very helpful incorrectly identifying successful and failedproductions.

Strengthening

The generalization and discriminationmechanisms are the inductive componentsof the learning system in that they are tryingto extract from examples of success and fail-ure the features that characterize when aparticular production rule is applicable. Thegeneralization and discrimination processesproduce multiple variants on the conditionscontrolling the same action. It is importantto realize that at any time the system is en-tertaining as its hypothesis a set of differentproductions with different conditions to con-trol the action—not just a single production(condition-action rule). There are advan-tages to be gained in expressive power bymeans of multiple productions for the sameaction, differing in their conditions. Becausethe features in a production condition aretreated conjunctively but separate produc-tions are treated disjunctively, one can ex-press the condition for an action as a dis-junction of conjunctions of conditions. Manyreal-world categories have the need for thisrather powerful expressive logic. Also, be-cause of specificity ordering, productions canenter into more complex logical relations, aswe noted.

However, because they are inductive pro-cesses, sometimes generalization and dis-crimination will err and produce incorrectproductions. There are possibilities for over-generalizations and useless discriminations.The phenomenon of overgeneralization iswell documented in the language acquisitionliterature, occurring both in the learning ofsyntactic rules and in the learning of naturallanguage concepts. The phenomena of pseu-dodiscriminations are less well documentedin language because a pseudodiscriminationdoes not lead to incorrect behavior, just un-necessarily restrictive behavior. However,there are some documented cases in the care-ful analyses of language development (e.g.,Maratsos & Chalkley, 1981). One reasonthat a strength mechanism is needed is be-cause of these inductive failures. It is alsothe case that the system may simply create


productions that are incorrect—either be-cause of misinformation or because of mis-takes in its computations. ACT uses itsstrength mechanism to eliminate wrong pro-ductions, whatever their source.

The strength of a production affects theprobability that it will be placed on the AP-PLYLIST and is also used in resolving tiesamong competing productions of equal spec-ificity on the APPLYLIST. These factors werediscussed earlier with respect to the full setof conflict-resolution principles in ACT (seesection on conflict resolution). ACT has anumber of ways of adjusting the strength ofa production in order to improve perfor-mance. Productions have a strength of .1when first created. Each time it applies, aproduction's strength increases by an addi-tive factor of .025. However, when a pro-duction applies and receives negative feed-back, its strength is reduced by a mul-tiplicative factor of .25. Because a multipli-cative adjustment produces a greater changein strength than does an additive adjustment,this "punishment" has much more impactthan a reinforcement does.

Although these two mechanisms are suf-ficient to adjust the behavior of any fixed setof productions, additional strengtheningmechanisms are required to integrate newproductions into the behavior of the system.Because these new productions are intro-duced with low strength, they would seemto be victims of a vicious cycle: They cannotapply unless they are strong, and they arenot strong unless they have applied. Whatis required to break out of this cycle is ameans of strengthening productions thatdoes not rely on their actual application.This is achieved by taking all of the strengthadjustments made to a production that ap-plies and making these adjustments to all ofits generalizations as well. Because a generalproduction will be strengthened every timeany one of its possibly numerous specializa-tions applies, new generalizations can amassenough strength to extend the range of sit-uations in which ACT performs successfully.Also, because a general production appliesmore widely, a successful general productionwill come to gather more strength than itsspecific variants.

For purposes of strengthening, re-creationof a production that is already in the system,

whether by proceduralization, composition,generalization, or discrimination, is treatedas equivalent to a successful application.That is, the re-created production receivesa .025 strength increment and so do all ofits generalizations.

The exact strengthening values encodedinto the ACT system are somewhat arbi-trary. The general relations among the val-ues are certainly important, but the exactrelations probably are not. If all the strengthvalues were multiplied by some scaling fac-tor, one would get the same performancefrom the system. They were selected to givesatisfactory performance in a set of lan-guage-learning examples described by An-derson et al. (1980).

Comparison to Other DiscriminationTheories •.

As in the case of generalization, ACT'smechanisms for discrimination have clearsimilarities to earlier ideas about discrimi-nation. ACT's discrimination mechanisms,like its generalization mechanisms, focus onstructural relations, whereas traditional ef-forts were more focused on continuous di-mensions. Brown (1977) has sketched outway's for extending ACT-like discriminationmechanisms to continuous dimensions, al-though we have not developed them in ACT.Also, it is the case that ACT discriminationmechanisms are really specified for an op-erant-conditioning paradigm (in that theaction of productions are evaluated accord-ing to whether they achieve desired behaviorand goals) and do not really address the clas-sical-conditioning paradigm in which a gooddeal of research has been done on discrim-ination. However, despite these major dif-ferences in character, a number of interest-ing connections can be drawn between ACTand the older conceptions of discrimination.In making these comparisons I will be draw-ing on strengthening and other conflict-res-olution principles in ACT as well as the dis-crimination mechanism.

Shift experiments. One of the supposedlycritical issues in choosing between thediscontinuity and continuity theories ofdiscrimination was the shift experiment(Spence, 1940). In that paradigm, subjectswho were still responding at a chance level


with respect to some discrimination (e.g.,white-black) had the reinforcement contin-gencies changed so that a different responsewas appropriate. According to the disconti-nuity theory, the subject's chance perfor-mance indicated control by an incorrect hy-pothesis, so the shift should not hurt, butaccording to the continuity theory, the sub-ject could still be building up "habit strength"for the correct response and a shift wouldhurt. Continuity theory tended to be sup-ported on this issue for infrahuman subjects(e.g., see Kendler & Kendler, 1975). ACTis like the discontinuity theory in that itsvarious productions represent alternative hy-potheses about how to solve a problem; how-ever, its predictions are in accord with thecontinuity theory because it can be accruingstrength for a hypothesis before the produc-tion is strong enough to apply and producebehavior. Of course, ACT's discriminationmechanisms cannot account for the shiftdata with adults (e.g., Trabasso & Bower,1968), but we have argued elsewhere (An-derson et al., 1979) that such data shouldbe ascribed to a conscious hypothesis-testingprocess that produces declarative learningrather than an automatic procedural learn-ing process.

Stimulus generalization and eventual dis-crimination. As noted earlier, the clausesin a production condition are like the ele-ments of stimulus-sampling theory. A prob-lem for stimulus-sampling theory (see Medin,1976, for a recent discussion) is how to ac-commodate both the fact of stimulus gen-eralization and the fact of eventual perfectdiscrimination. The fact of stimulus gener-alization can easily be explained in stimulus-sampling theory by assuming that two stim-ulus conditions overlap in their elements.However, if so, the problem becomes howperfect discrimination behavior can beachieved when the common elements can beassociated to the wrong response.

In the ACT theory one can think of theoriginal productions for behavior as basicallytesting for the null set of elements:

PI. IF the goal is X,THEN do Y.

With discrimination, elements can be broughtin to discriminate between successful and

unsuccessful situations, for example,

P2. IF the goal is Xand B is present,

THEN do Y.

P3. IF the goal is Xand B is presentand C is present,

THEN do Z.P4. IF the goal is X

and D is present,THEN do Y,

and so forth. This is like the conditioning offeatures to responses in stimulus-samplingtheory.

If some features occur sometimes in sit-uations for Response Y and sometimes insituations for Response Z, discriminationcan cause them to become parts of produc-tions recommending one of the actions. Forinstance, suppose B is such a feature thatreally does not discriminate between the ac-tions. Suppose B is present in the currentsituation where response Z is executed, butthe system receives feedback indicating thatY is correct. Further, suppose B was notpresent (or not attended) in the past priorsituation where response Z had proved suc-cessful. Production P2 would be formed asan action discrimination. The B test is use-less because B is just as likely to occur in aZ situation. This corresponds to the condi-tioning of common elements. However, inACT the strengthening, discrimination, andspecificity processes can eventually repressproductions that are responding just to com-mon elements. For instance, further discrim-inative features can be added (as in P3) thatwill serve to block out the incorrect appli-cation of P2. Also, it is possible to simplyweaken P2 and add a new production likeP4, which perhaps contains the correct dis-crimination.

Patterning effects. The ACT discrimi-nation theory also explains how subjects canlearn to give a response in the presence ofStimuli A and B together but neither A norB alone. This simply requires that two dis-criminative clauses be added to the produc-tion, one for A and one for B. Respondingto such configural cues was a problem forsome of the earlier discrimination theories(see Rudy & Wagner, 1975, for a review).The power of the ACT theory over these


early theories is that productions can re-spond to patterns of elements rather than toeach element separately.

ACT also explains the fact that in thepresence of correlated stimuli, one stimulusmay partially or completely overshadow asecond one (see MacKintosh, 1975, for areview). Thus, if both A and B are trainedas a correlated pair to response R, one mayfind that A has less ability to evoke R alonethan if it were the only cue associated withR. Sometimes if B is much more salient, Amay have no control over R at all. In ACTthe discrimination mechanism will chooseamong the available features (A, B, andother irrelevant stimuli) with probabilitiesreflecting their salience. Thus, it is possiblethat a satisfactory discrimination involvingB will be found and that this production willbe strengthened to where it is dominatingbehavior and producing satisfactory resultsso that the A discrimination will never bemade. It is also possible that even after aproduction is formed with the B discrimi-nation, it is too weak to apply, an error oc-curs, and an A discrimination occurs. In thatcase both A and B might develop as alternateand equally strong bases for responding.Thus, the ACT theory does not predict thatovershadowing will always occur but allowsit to occur and predicts it to be related tothe differential salience of the competingstimuli.

Evidence for the ACT TuningMechanisms

We have spent and are spending some con-siderable effort in gathering evidence rele-vant to evaluation of the tuning mechanismsdescribed here. One issue concerns suffi-ciency: Are the learning mechanisms de-scribed here adequate to produce intelligent,adaptive, and stable performance as an endproduct? Anderson et al. (1981) describe ourefforts to establish sufficiency for the domainof geometry theorem proving. We were ableto show adaptive behavior on the small scale,but size limitations on our simulation pro-gram prevented us from assessing what theeventual behavior of the program would beif it were to work through a course of studylike that of a high school student.

Because of various technical optimiza-tions, I was able to assess this issue of suf-ficiency more adequately in the case oflanguage acquisition (Anderson, 1981a).Although not able to achieve anything sogrand as a system with total competence ina language, I was able to show that the learn-ing mechanisms did converge to produce cor-rect syntactic behavior on various subsets ofa number of different languages.

It is difficult to assess the psychologicalaccuracy of the programs in these areas.Because of the scale of the phenomena, thereare no careful empirical characterizations ofthe variety with which an experimental psy-chologist likes to work. For the same reasonit is not possible to get reliable statisticsabout the performance of the simulation pro-grams. In addition, the simulations requirerather major simplifying assumptions. So,to check the empirical accuracy of the tuningmechanisms, we have looked at behavior ofthe program on various schema-abstractionor prototype-formation tasks (e.g., Franks& Bransford, 1971; Hayes-Roth & Hayes-Roth, 1977; and Medin & Schaffer, 1978).These are relatively tf actible empirical phe-nomena on which it is possible to do carefulanalytic experiments. In Anderson et al.(1979), we were able to show our mecha-nisms capable of simulating many of theestablished phenomena. In Elio and Ander-son (1981), we were able to confirm predic-tions from ACT's generalization mechanismthat served to discriminate it from other the-ories.

Procedural Learning: The Power Law

One aspect of skill acquisition is distin-guished both by its ubiquity and by its sur-face contradiction to ACT's multiple stage,multiple mechanism view of skill develop-ment. This is the log-linear or power law forpractice: A plot of the logarithm of time toperform a task against the logarithm ofamount of practice approximates a straightline. It has been widely discussed with re-spect to human performance (Fitts & Pos-ner, 1967; Welford, 1968) and has been thesubject of a number of recent theoreticalanalyses (Newell & Rosenbloom, 1981;Lewis, Note 3). It is found in phenomena as


100

.§ 10I-

-»T*I4.85N--44

10 100Pages Read

1000

Figure 10. The effect of practice on the speed with whichsubjects can read inverted text. (From Kolers, 1975.)

diverse as motor skills (Snoddy, 1926), pat-tern recognition (Neisser, Novich, & Lazar,1963), problem solving (Neves & Anderson,1981), memory retrieval (Anderson, inpress), and, suspiciously, in machine build-ing by industrial plants (an example of in-stitutional learning rather than humanlearning; Hirsch, 1952). Figure 10 illustratesone example: the effect of practice on thespeed with which inverted text can be read(Kolers, 1975). This ubiquitous phenomenonwould seem to contradict the ACT theoryof skill acquisition because at first it seemsthat a theory that proposes changing mech-anisms of skill acquisition would not predictthe apparent uniformity of the speedup.Also, it is not immediately clear why ACTwould predict a power function rather than,say, an exponential function. Because of theubiquity of the power law, it is important toshow that the ACT learning theory is con-sistent with this phenomenon.

The general form of the equation relatingtime (T) to perform a task to amount ofpractice (P) is

T = X + AP -b(1)

where X is the asymptotic speed, X + A isthe speed on Trial 1, and b is the slope ofthe function on a log-log plot—where timeis plotted as log(T — X). The value of b isalmost always in the interval 0 to 1. Theasymptotic X is usually very small relativeto X + A, and the rate of approach to asymp-tote is slow in a power function. This meansthat it is often possible to get very good fitsin plots like Figure 10, assuming a zero

asymptote. However, careful analysis of datawith enough practice does indicate evidencefor nonzero asymptotes.

These facts about skill speedup have ap-peared contradictory to ACT-like learn-ing mechanisms because ACT mechanismswould seem to imply speedup that is fasterthan one would obtain with a power law.Composition, as developed earlier, collapsedpairs of productions into single productions.It was noted (p. 384) that composition seemsto predict a speedup on the order of Bap—which is to say an exponential function ofpractice, P (a is less than 1). An exponentiallaw, as noted by Newell and Rosenbloom(1981), is in some sense the natural predic-tion about speedup. It implies that with eachpractice trial the subject can improve a con-stant fraction (a) of his current time (or hasa constant probability of improving by a con-stant amount). When we look at ACT's tun-ing mechanisms of discrimination and gen-eralization, it is harder to make generalclaims about the speedup they will producebecause their speedup will depend on thecharacteristics of the problem space. How-ever, it is at least plausible to propose thateach discrimination or generalization has aconstant expected factor of improvement.Composition, generalization, and discrimi-nation improve performance by reducing theexpected number of productions applied inperforming a task. I will refer to improve-ment due to reduction in number of produc-tions as algorithmic improvement.

In contrast to algorithmic improvement,strengthening reduces the time for individualproductions of the procedure to apply. I willshow that the strengthening process in ACTdoes result in a power law. However, evenif strengthening obeys a power law, it is notimmediately obvious why the total process-ing, which is a product of both algorithmicimprovement and strengthening, should obeya power law. Nonetheless, I will set forth aset of assumptions under which this is justwhat is predicted by ACT and in so doingwill resolve the problem.

Strengthening

Although complex processes like editingor proof generation appear to obey a power


law, it is also the case that simple processeslike simple choice reaction time (Mowbray& Rhoades, 1959) or memory retrieval(Anderson, in press) may do the same. Inthese cases the speedup cannot be modeledas an algorithmic improvement in numberof production steps. There cannot be morethan a small number of productions (e.g.,10) applying in the less than 500 msec re-quired for these tasks. A process of reducingthat number would not produce the contin-uous improvements observed. Moreover,subjects may well start out with optimal ornear optimal procedures in terms of mini-mum number of productions, so there oftenis little or no room for algorithmic improve-ment. Here we have to assume that thespeedup observed is due to a basic increasein the rate of production application, aswould be produced by ACT's strengtheningprocess.

Recall from the earlier discussions (see thesection on conflict resolution) that time toapply a production is c + (a/s) where s isthe production strength, c reflects processesin production application, and a is the timefor a unit-strength production to be selected.Strength increases by one unit (a unit is ar-bitrarily .025 in our theory) with each trialof practice. Therefore, we can simply replaces in the expression above by P, the numberof trials of practice. Then, the form of thepractice function for production executionin ACT takes the form

T = c + aP~

which is a hyperbolic function, one form ofthe power law. This assumes that on the firstmeasured trial (P = 1), the production al-ready has 1 unit of strength from an earlierencoding opportunity. The time for N suchproductions to apply would be

T = cN + aNP~l (3)

orT = C + A/'-1, (4)

where C = cN and A = aN. This is a powerlaw where the exponent is equal to 1 and theasymptote is C. The problem is that unlesspeculiar assumptions are made about priorpractice (see Newell & Rosenbloom, 1981),the exponent obtained is typically much lessthan 1 (usually in the range .1 to .6).

However, the smaller exponents are to bepredicted when one takes into account thatthere is forgetting or loss of strength fromprior practice. Thus, a better form of Equa-tion 4 would be

p- iT = C + [A/ 2 s(/, />)], (5)

where s(i, P) denotes the strength remainingfrom the /th strengthening when the Pth trialcomes about. In the above s(0, P) denotesthe strength on Trial P of the initial encodingtrial. To understand the behavior of thisfunction we have to understand the behaviorof the critical sum:

S = 2 s(i, P).1-0

(6)

Wickelgren (1976) has shown that thestrength of the memory trace decays as apower law. Assuming that time is linear innumber of practice trials we have

i, P) = D(P - (7)

where D is the initial strength and d <Combining Equations 6 and 7 we get

1.

ps= 2 or (8)

This function is bounded below and aboveas follows:

D(2) l - d

[ ( />+!) ' '"-I)]

D-d); (9)

S can be approximated by the average ofthese upper and lower bounds and becausethe difference between (P + 1)' ~ d and P1 ~d

becomes increasingly small with large P wemay write

*-, *-' t — 1 ._. A ' _ -, -.

1 - d

where X = (1 + d)/2. This factor X willbecome increasingly insignificant as P getslarge. So, the important observation is thatto a close approximation, total strength willgrow as a power law. Substituting back intoEquation 5 we get


T = C(P) + A'P~e, (10)

where A' = A(l - d)/D; g = 1 - d; andC(P) obeys the equation

- d)XC(P) = C + pi -d _ X) ' (11)

which converges to C as P gets large. So forlarge P, Equation 10 will give a good ap-proximation to a power law. Equation 10 hasalso proven to provide a good approximationfor small P in my hand-calculated examples.Thus, the ACT model predicts that time fora fixed sequence of productions should de-crease as a power law with the exponentdeviating from 1 (and a hyperbolic function)to the degree that there is forgetting. Thebasic prediction of a power function is con-firmed in simple tasks; the further predictionrelating the exponent to forgetting is a dif-ficult issue requiring further research. How-ever, it is known that forgetting does reducethe effect of practice (e.g., Kolers, 1975).Given that forgetting must be an importantfactor in the long-term development of askill, the ACT analysis of the power law isat a distinct advantage over other analysesthat do not accommodate forgetting effects.

Algorithmic Improvement

There is an interesting relation betweenthis power law for simple tasks, based juston strength accumulation, and the power lawfor complex tasks where there is also thepotential for reduction in number of pro-duction steps. We noted in the case of com-position that a limit on this process is thatall the information to be matched by thecomposed production must be active inworking memory. Because the size of pro-duction conditions (despite the optimizationproduced by proceduralization) tends to in-crease exponentially with compositions, therequirements on working memory for thenext composition tend to increase exponen-tially with the number of compositions. Itis also the case that as successful discrimi-nations and generalizations proceed, therewill be an increase in the amount of infor-mation that needs to be held in workingmemory so that another useful feature canbe identified. In this case it is not possibleto make precise statements concerning the

factor of increase, but it is not unreasonableto suppose that this increase is also expo-nential with" number of improvements. It isthen implied that the following relationshould define the size (WO of working mem-ory needed for the j'th algorithmic improve-ment:

W=GH', (12)

where G and H are the parameters of theexponential function.

The ACT theory predicts that a power lawshould describe the amount of activation ofa knowledge structure as a function of prac-tice (in the concepts or links that define thatstructure). By the same analysis as the onejust given for production strength, ACT pre-dicts that the strength of memory structuresshould increase as a power function of prac-tice. The strength of a memory structuredirectly determines the amount of activationit will receive. Thus, we have the followingequation describing total memory activation(A) as a function of practice:

A = QP\ (13)

where Q and r are the parameters of thepower function. (Note that P is raised to apositive exponent, r, less than one.) Thisequation is more than just theoretical spec-ulation; work in our laboratory on effects ofpractice on memory retrieval has supportedthis relation. A small proportion of this workis discussed in Anderson (in press).

There is a strong relationship in the ACTtheory between the working memory re-quirements described by Equation 12 andthe total activation described by Equation13. For an amount, W, of information to beavailable in working memory, the informa-tion must reach a threshold level of activa-tion L, which means that the total amountof activation of the information structurewill be described by

= WL. (14)

Equations 12, 13, and 14 may be combinedto derive a relation between the number ofimprovements (j) and amount of practice:

.= log(P) log(Q)log (H) log (H)

log (L)log (H)

log (G)log (H)

(15)


or, more simply,

i = a + b log (P). (16)

Thus, because of working memory limita-tions, the rate of algorithmic improvementis a logarithmic rather than a linear functionof practice. Continuing with the assumptionthat the number of steps (N) should be re-duced by a constant fraction, f, with eachimprovement, we get

TV = Nof'or

(17)

(18)

where

f' = -b log (f) and N'0 = N0fa. (19)

Thus, the number of productions to be ap-plied should decrease as a power function ofpractice. Equation 18 assumes that in thelimit, 0 steps are required to perform thetask, but there must be some minimum N*that is the optimal procedure. Clearly, N*has at least the value 1. Exactly, how to in-troduce this minimum into Equation 18 willdepend on one's analysis of the improve-ments. If we simply add it, we get the stan-dard power function for improvement to anasymptote:

N= N* (20)

Let us review the analysis of the powerlaw to date. We started with the observationthat, assuming that the rate of algorithmicimprovement is linear with practice and thateach improvement has a proportional de-crease in number of productions, an expo-nential practice function is predicted, not apower practice function. We noted that themechanisms of strength accumulation pre-dict that individual productions should speedup as a power function. Similar strengthdynamics governing the growth of workingmemory size imply that the rate of algo-rithmic improvement is actually logarithmicand therefore the decrease in number of pro-ductions would be a power function.

It should be noted that the relation be-tween working memory capacity and im-provements in the production algorithm cor-responds to a common subject experience oncomplex tasks. Initially, subjects report feel-

ing swamped, trying just to keep up with thetask, and have no sense of the overall or-ganization of the task. With practice, sub-jects report beginning to perceive the struc-ture of the task and claim to be able to seehow to make improvements. It is certainlythe case that we observe subjects to be betterable to maintain current state and goal andbetter able to retrieve past goals and statesof the task. Thus, it seems that their workingmemory for the problem improves with prac-tice and subjects claim that being able toapprehend at once a substantial portion ofthe problem is what is critical to making im-provements.

Algorithmic Improvement andStrengthening Combined

The total time to perform a task is deter-mined by the number of productions and thetime per production. Therefore, the simplestprediction about total time (TT) would beto combine multiplicatively Equation 10,describing time per production, and Equa-tion 20, describing number of productions:

TT = (N* + Ny»-f')(C + AT"8). (21)

Because of the asymptotic components, N*and C', the above will not be a pure powerlaw but it will look like a power function toa good approximation (as good an approxi-mation as is typically observed empirically).If N* and C' were 0, then we would have apure power law of the form

which has a zero asymptote. Because the ini-tial time is so large relative to final time,most data are fit very well assuming a zeroasymptote. This is the form of the equationwe will use for further discussion.

One complication ignored in the foregoingdiscussion is that algorithmic improvementsin the number of productions typically meancreation of new productions. According tothe theory new productions start off with lowstrength. Thus, productions at later pointsin the experiment will not have been prac-ticed since the beginning of the experimentand will have lower strength than assumedin Equations 21 and 22. Another compli-cation on top of this is that a completely new


« Tota I Time* Total Steps. Time Per Slap

100.0 —Q.

i t i i i i i i i1.0 L 6 2 . 5 4iO 6.3 10.0 15.6 25.1 39.8 63.! 100.0

log (trials)

Figure 11. The effect of practice on a reason-giving task.(Plotted separately are the effects on number of steps,time per step, and total time. From Neves and Anderson,1981.)

set of productions will not be instituted witheach improvement; only a subset will change.Suppose that at any time the productions inuse were introduced an average ofj improve-ments ago. This means (by Equation 15)that after the z'th improvement the averageproduction has been practiced from TrialKL'~ j to Trial KL' and therefore has hadKL'(1 — L~j) trials of practice, where K =H~a/r and L = H1/r from Equations 15 and16. Thus, the number of trials of practice(P*) for a production is expected to be aconstant fraction of the total number oftrials (P) on the task:

P* = q/>, (23)

where q = (1 — Lj)- This implies that thecorrect form of Equation 20 is

TT = N'0A'q'sP-(r + e\ (24)

Thus, this argument does not at all affectthe expectation of a power function.

An Experimental Test

The basic prediction of this analysis is thatboth number of productions and time per

production should decrease as a power func-tion of practice. As a result, total time willdecrease as a power function. Neves andAnderson (1981) have tested this predictionin an experiment that studied subjects' abil-ity to give reasons for the lines of an abstractlogic proof. This reason-giving task is mod-eled after a frequent kind of exercise foundin high school geometry texts (see Figure 3).However, we wanted to use the task withcollege students and wanted to see the effectsof practice, starting from the beginning.Therefore, we invented a novel artificialproof system. Each proof consisted of 10lines. Each line could be justified as a givenor derived from earlier lines by applicationof one of nine postulates. Subjects could onlysee the current line of the proof and had torequest of a computer to display particularprior lines, givens, or postulates. The methodof requesting this information was very easy,and so we hoped to be able to trace, by sub-jects' request behavior, the steps of the al-gorithm they were following. The relationbetween requests and production applicationis almost certainly one to many, but we be-lieve that we can use these requests as anindex of the number of productions that areapplying. The basic assumption is that theratio of productions to requests will notchange over time. This assumption certainlycould be challenged, but I think it is plau-sible and is strongly supported by the or-derliness of the results. Under this assump-tion if we plot number of requests as afunction of practice, we are looking at thereduction in the number of productions oralgorithmic improvement. If we plot time perrequest, we are looking at the improvementin the speed of individual productions.

Figure 11 presents the analysis of thesedata averaged over three subjects (individualsubjects show the same pattern). Subjectstook about 25 minutes to do the first prob-lem. After 90 problems they were often tak-ing under 2 minutes to do the proofs. Thisreflects the impact of approximately 10hours of practice. As can be seen from Fig-ure 11, both number of steps (informationrequests) and time per step (interval betweenrequests) go down as power functions ofpractice. Hence, total time also obeys apower function. The exponent for the num-


her of steps is —.346 (ranging from —.315to —.373 for individual subjects), whereasthe exponent for the time per step is —.198(ranging from -.144 to -.226).

The Neves and Anderson (1981) experi-ment does provide evidence that underlyinga power law in complex tasks are power lawsboth in number of steps applied and in timeper step. I have shown how a power law instrength accumulation may underlie both ofthese phenomena. Although it is true thatalgorithmic improvement would tend to pro-duce exponential speedup, the underlyingstrength dynamics determine working mem-ory capacity and produce a power functionin algorithmic improvement. It is temptingto think of these strength dynamics as de-scribing a process at the neural level of thesystem. Therefore, it is interesting to noteEccles's (1972) review of the evidence thatindividual neurons increase with practiceand decrease with disuse in their rate oftransmitter release and pickup across syn-apses.

Summary

We have now reviewed the basic progres-sion of skill acquisition according to theACT learning theory. It starts out asthe interpretive application of declarativeknowledge; this becomes compiled into aprocedural form, and this procedural formundergoes a process of continual refinementof conditions and raw increase in speed. Ina sense this is a stage analysis of humanlearning. Much as other stage analyses ofhuman behavior, this stage analysis of ACTis being offered as an approximation to char-acterize a rather complex system of inter-actions. Any interesting behavior is pro-duced by a set of elementary components,and different components can be at differentstages. For instance, part of a task can beperformed interpretively, whereas anotherpart is performed as compiled.

The claim is that the configuration oflearning mechanisms described is involvedin the full range of skill acquisition from lan-guage acquisition to problem solving toschema abstraction. Another strong claimis that the basic control architecture acrossthese situations is hierarchical, goal struc-

tured, and basically organized for problemsolving. This echoes the claim made else-where (Newell, 1980) that problem solvingis the basic mode of cognition. The claim isthat the mechanisms of skill acquisition ba-sically function within the mold provided bythe basic problem-solving character of skills.As skills evolve they become more tuned andcompiled, and the original search of theproblem space may drop out as a significantaspect. I have presented a variety of theo-retical analyses and experimental analysesthat provide positive evidence for this broadview of skill acquisition. Clearly, many moreanalyses and experimental tests can be done.However, the available evidence at least con-veys a modest degree of credibility to thetheory presented.

In conclusion, I would like to point outthat the learning theory proposed here hasachieved a unique accomplishment. Unlikepast learning theories it has cogently ad-dressed the issue of how symbolic or cog-nitive skills are acquired. (Indeed, I havebeen so focused on this, I have ignored someof the phenomena that traditional learningaddressed, such as classical conditioning.)The inadequacies of past learning theoriesto account for symbolic behavior have beena major source of criticism. On the otherhand, unlike many of the current cognitivetheories, ACT not only provides an analysisof the performance of a cognitive skill butalso an analysis of its acquisition. Many re-searchers (e.g., Estes, 1975; Langley & Si-mon, 1981; Rumelhart & Norman, 1978)have lamented how the strides in task anal-ysis within cognitive psychology have notbeen accompanied by strides in developmentof learning theory.

If I were to select the conceptual devel-opments most essential to this theory of theacquisition of cognitive skills, I would pointto two. First, there is the clear separationmade in ACT between declarative knowl-edge (prepositional network of facts) andprocedural knowledge (production system).The declarative system has the capacity torepresent abstract facts. The production sys-tem through its use of variables can processthe prepositional character of this data base.Also, productions through their reference togoal structures have the capacity to shift


attention and control in a symbolic way.These basic symbolic capacities are essentialto the success of the learning mechanisms.Knowledge is integrated into the system byfirst being encoded declaratively and thenbeing interpreted. We argued that the suc-cessful integration of knowledge into behav-ior requires that it first go through such aninterpretive stage. The various learningmechanisms all are structured around vari-able use and reference to goal structures.Moreover, the learning processes impact onthe course of the symbolic processing, mak-ing it both faster and more judicious inchoice. In ACT we see how learning andsymbolic processing could be synergetic.These two aspects of cognition surely aresynergetic in man, and this fact commendsthe theory for consideration at least as muchas any specific issue that was considered.

Second is the ACT production system ar-chitecture itself. Productions are relativelysimple and well-defined objects, and this isessential if one is to produce general learningmechanisms. The general learning mecha-nisms must be constituted so that they willcorrectly operate on the full range of struc-tures (productions) that they might encoun-ter. It is possible to construct such learningmechanisms for ACT productions; it wouldnot be possible if the procedural formalismwere as diverse and unconstrained as areLISP functions. ACT productions have thevirtue of stimulus-response bonds with re-spect to their simplicity but also have con-siderable computational power. A problemwith many production system formalismswith respect to learning is that it is hard forthe learning mechanism to appreciate thefunction of the production in the overall flowof control. This is why the use of goal struc-tures is such a significant augmentation tothe ACT architecture. By inspecting the goalstructure in which a production applicationparticipates, it is possible to understand therole of the production. This is essential toa system that learns by doing.

Reference Notes1. Rychener, M. D. Approaches to knowledge acquisi-

tion: The instructable production system project.Unpublished manuscript, Carnegie-Mellon Univer-sity, 1981.

2. Anderson, J. R., Kline, P. J., & Beasley, C. M. Atheory of the acquisition of cognitive skills (ONRTech. Rep. 77-1). New Haven, Conn.: Yale Univer-sity, Department of Psychology, 1977.

3. Lewis, C. H. Speed and practice. Unpublishedmanuscript, 1979. (Available from C. H. Lewis, IBMResearch Center, Yorktown Heights, New York10598.)

ReferencesAnderson, J. R. Language, memory, and thought. Hills-

dale, N.J.: Erlbaum, 1976.Anderson, J. R. Cognitive psychology and its impli-

cations. San Francisco, Calif.: Freeman, 1980.Anderson, J. R. A theory of language acquisition based

on general learning mechanisms. Proceedings of theSeventh International Joint Conference on ArtificialIntelligence, 1981, 97-103. (a)

Anderson, J. R. Tuning of search of the problem spacefor geometry proofs. Proceedings of the Seventh In-ternational Joint Conference on Artificial Intelli-gence, 1981, 165-170. (b)

Anderson, J. R., Retrieval of information from long-term memory. Science, in press.

Anderson, J. R., Greeno, J. G., Kline, P. J., & Neves,D. M. Acquisition of problem-solving skill. In J. R.Anderson (Ed.), Cognitive skills and their acquisi-tion. Hillsdale, N.J.: Erlbaum, 1981.

Anderson, J. R., Kline, P. J., & Beasley, C. M. A generallearning theory and its application to schema abstrac-tion. In G. H. Bower (Ed.), The psychology of learn-ing and motivation (Vol. 13). New York: AcademicPress, 1979, 277-318.

Anderson, J. R., Kline, P. J., & Beasley, C. M. Complexlearning processes. In R. E. Snow, P. A. Federico,& W. E. Montague (Eds.), Aptitude, learning, andinstruction (Vol. 2). Hillsdale, N.J.: Erlbaum, 1980.

Braine, M. D. S. On learning grammatical order ofwords. Psychological Review, 1963, 70, 323-348.

Briggs, G. E., & Blaha, J. Memory retrieval and centralcomparison times in information processing. Journalof Experimental Psychology, 1969, 79, 395-402.

Brown, D. J. H. Concept learning by feature value in-terval abstraction. Proceedings of the Workshop onPattern-Directed Inference Systems, 1977, 55-60.

Brown, J. S., & Van Lehn, K. Repair theory: A gen-erative theory of bugs in procedural skills. CognitiveScience, 1980, 4, 379-426.

Brown, R. A first language. Cambridge, Mass.: HarvardUniversity Press, 1973.

Burke, C. J., & Estes, W. K. A component model forstimulus variables in discrimination learning. Psy-chometrika, 1951,22, 133-145.

Eccles, J. C. Possible synaptic mechanisms subservinglearning. In A. G. Karyman & J. C. Eccles (Eds.),Brain and human behavior. New York: Springer-Ver-lag, 1972.

Elio, R., & Anderson, J. R. Effects of category gener-alizations and instance similarity on schema abstrac-tion. Journal of Experimental Psychology: HumanLearning and Memory, 1981, 7, 397-417.

Estes, W. K. Toward a statistical theory of learning.Psychological Review, 1950, 57, 94-107.


Estes, W. K. The state of the field: General problemsand issues of theory and metatheory. In W. K. Estes(Ed.), Handbook of learning and cognitive processes(Vol. 1). Hillsdale, N.J.: Erlbaum, 1975.

Fitts, P. M. Perceptual-motor skill learning. In A, W.Melton (Ed.), Categories of human learning. NewYork: Academic Press, 1964.

Fitts, P. M., & Posner, M. I. Human performance.Monterey, Calif.: Brooks/Cole, 1967.

Forgy, C., & McDermott, J. OPS, a domain-indepen-dent production system. Proceedings of the Fifth In-ternational Joint Conference on Artificial Intelli-gence, 1977, 933-939.

Franks, J. J., & Bransford, J. D. Abstraction of visualpatterns. Journal of Experimental Psychology, 1971,90, 65-74.

Greeno, J. G. Indefinite goals in well-structured prob-lems. Psychological Review, 1976, 83, 479-491.

Hayes-Roth, B., & Hayes-Roth, F. Concept learningand the recognition and classification of exemplars.Journal of Verbal Learning and Verbal Behavior,1977, 16, 321-338.

Hayes-Roth, F., & McDermott, J. Learning structuredpatterns from examples. Proceedings of the ThirdInternational Joint Conference on Pattern Recogni-tion, 1976, 419-423.

Heinemann, E. C., & Chase, S. Stimulus generalization.In W. K. Estes (Ed.), Handbook of learning and cog-nitive processes (Vol. 2). Hillsdale, N.J.: Erlbaum,1975.

Hirsch, W. Z. Manufacturing progress functions. Re-view of Economics and Statistics, 1952, 34, 143-155.

Jurgensen, R. C., Donnelly, A. J., Maier, J. E., & Ris-ing, G. R. Geometry. Boston, Mass.: Houghton Mif-flin, 1975.

Kendler, H. H., & Kendler, T. S. From discriminationlearning to cognitive development: A neobehavioristicodyssey. In W. K. Estes (Ed.), Handbook of learningand cognitive processes (Vol. 1). Hillsdale, N.J.: Erl-baum, 1975.

Kolers, P. A. Memorial consequences of automatizedencoding. Journal of Experimental Psychology: Hu-man Learning and Memory, 1975, I , 689-701.

Langley, P., & Simon, H. A. The central role of learningin cognition. In J. R. Anderson (Ed.), Cognitive skills

< and their acquisition. Hillsdale, N.J.: Erlbaum, 1981.Larkin, J. H., McDermott, J., Simon, D. P., & Simon,

H. A. Expert and novice performance in solving phys-ics problems. Science, 1980, 208, 1335-1342.

Larson, J., & Michalski, R. S. Inductive inference ofVL decision rules. Proceedings of the Workshop onPattern-Directed Inference Systems, 1977, 38-44.

Lewis, C. H. Production system models of practice ef-fects. Unpublished doctoral dissertation, Universityof Michigan, 1978.

Luchins, A. S. Mechanization in problem solving. Psy-chological Monographs, 1942,54(6, Whole No. 248).

Luchins, A. S., & Luchins, E. H. Rigidity of behavior;a variational approach to the effect of Einstellung.Eugene: University of Oregon Books, 1959.

MacKintosh, N. J. From classical conditioning to dis-crimination learning. In W. K. Estes (Ed.), Handbookof learning and cognitive processes (Vol. 1). Hills-dale, N.J.: Erlbaum, 1975.

Maratsos, M. P., & Chalkley, M. A. The internal lan-guage of children's syntax: The ontogenesis and rep-resentation of syntactic categories. In K. Nelson(Ed.), Children's language (Vol. I). New York: Gard-ner Press, 1981.

McNeill, D. On theories of language acquisition. InT. R. Dixon & D. L. Horton (Eds.), Verbal behaviorand general behavior theory. Englewood Cliffs, N.J.:Prentice-Hall, 1968.

Medin, D. L. Theories of discrimination learning andlearning set. In W. K. Estes (Ed.), Handbook oflearning and cognitive processes (Vol. 3). Hillsdale,N.J.: Erlbaum, 1976.

Medin, D. L., & Schaffer, M. M. A context theory ofclassification learning. Psychological Review, 1978,85, 207-238.

Miller, G. A., Galanter, E., & Pribram, K. H. Plansand the structure of behavior. New York: Holt, Rine-hart & Winston, 1960.

Mowbray, G. H., & Rhoades, M. V. On the reductionof choice reaction times with practice. QuarterlyJournal of Experimental Psychology, 1959, / / , 16-23.

Neisser, U., Novick, R., & Lazar, R. Searching for tentargets simultaneously. Perceptual and Motor Skills,1963, 77, 955-961.

Neves, D. M. Learning procedures from examples.Unpublished doctoral dissertation, Carnegie-MellonUniversity, 1981.

Neves, D. M., & Anderson, J. R. Knowledge compi-lation: Mechanisms for the automatization of cogni-tive skills. In J. R. Anderson (Ed.), Cognitive skillsand their acquisition. Hillsdale, N.J.: Erlbaum, 1981.

Newell, A. Reasoning, problem-solving, and decisionprocesses: The problem space as a fundamental cat-egory. In R. Nickerson (Ed.), Attention and perfor-mance VIII. Hillsdale, N.J.: Erlbaum, 1980.

Newell, A., & Rosenbloom, P. Mechanisms of skill ac-quisition and the law of practice. In J. R. Anderson(Ed.), Cognitive skills and their acquisition. Hills-dale, N.J.: Erlbaum, 1981.

Norman, D. A. Discussion: Teaching, learning, and therepresentation of knowledge. In R. E. Snow, P. A.Federico, & W. E. Montague (Eds.), Aptitude, learn-ing, and instruction (Vol. 2). Hillsdale, N.J.: Erl-baum, 1980.

Rudy, J. W., & Wagner, A. R. Stimulus selection inassociative learning. In W. K. Estes (Ed.), Handbookof learning and cognitive processes (Vol. 2). Hills-dale, N.J.: Erlbaum, 1975.

Rumelhart, D. E., & Norman, D. A. Accretion, tuning,and restructuring: Three modes of learning. In J. W.Cotton & R. Klatzsky (Eds.), Semantic factors incognition. Hillsdale, N.J.: Erlbaum, 1978.

Rychener, M. D., & Newell, A. An instructible pro-duction system: Basic design issues. In D. A. Water-man & F. Hayes-Roth (Eds.), Pattern-directed in-ference systems. New York: Academic Press, 1978.

Schneider, W., & Shiffrin, R. M, Controlled and au-tomatic human information- processing: I. Detection,search, and attention. Psychological Review, 1977,84, 1-66.

Shiffrin, R. M., & Dumais, S. T. The development ofautomatism. In J. R. Anderson (Ed.), Cognitive skillsand their acquisition. Hillsdale, N.J.: Erlbaum, 1981.


Shiffrin, R. M., & Schneider, W. Controlled and au-tomatic human information processing: II. Perceptuallearning, automatic attending, and a general theory.Psychological Review, 1917,84, 127-190.

Snoddy, G. S. Learning and stability. Journal of Ap-plied Psychology, 1926, 10, 1-36.

Spence, K. W. Continuous versus noncontinuous inter-pretations of discrimination learning. PsychologicalReview, 1940, 47, 271-288.

Sternberg, S. Memory scanning: Mental processes re-vealed by reaction time experiments. American Sci-entist, 1969, 57, 421-457.

Trabasso, T. R., & Bower, G. H. Attention in learning.New York: Wiley, 1968.

Vere, S. A. Inductive learning of relational productions.In D. A. Waterman & F. Hayes-Roth (Eds.), Pat-tern-directed inference systems. New York: Aca-demic Press, 1978.

Welford, A. T. Fundamentals of skill. London: Me-thuen, 1968.

Wickelgren, W. A. Memory storage dynamics. InW. K. Estes (Ed.), Handbook of learning and cog-nitive processes (Vol. 4). Hillsdale, N.J.: Erlbaum,1976.

Received July 16, 1981Revision received December 18, 1981

acquisition of cognitive skill - semantic scholaracquisition of cognitive skill john r. anderson...

Documents