machine learning in games: a survey ucenie/oefai-tr-2000-31.pdf · machine learning in games: a...

55
Machine Learning in Games: A Survey Johannes F ¨ urnkranz Austrian Research Institute for Artificial Intelligence Schottengasse 3, A-1010 Wien, Austria E-mail: [email protected] Technical Report OEFAI-TR-2000-31 Abstract This paper provides a survey of previously published work on machine learn- ing in game playing. The material is organized around a variety of problems that typically arise in game playing and that can be solved with machine learn- ing methods. This approach, we believe, allows both, researchers in game playing to find appropriate learning techniques for helping to solve their problems as well as machine learning researchers to identify rewarding topics for further research in game-playing domains. The paper covers learning techniques that range from neural networks to decision tree learning in games that range from poker to chess. However, space constraints prevent us from giving detailed introductions to the used learning techniques or games. Overall, we aimed at striking a fair balance between being exhaustive and being exhausting. To appear in: J. F¨ urnkranz & M. Kubat: Machines that Learn to Play Games, Nova Scientific Publishers, Chapter 2, pp. 11–59, Huntington, NY, 2001. 1

Upload: others

Post on 07-Jan-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

Machine Learning in Games: A Survey

JohannesFurnkranzAustrianResearchInstitutefor Artificial Intelligence

Schottengasse3, A-1010Wien,Austria

E-mail: [email protected]

TechnicalReportOEFAI-TR-2000-31�

Abstract

This paperprovidesa survey of previously publishedwork on machinelearn-ing in gameplaying. The material is organizedarounda variety of problemsthat typically arisein gameplaying andthat canbe solved with machinelearn-ing methods.Thisapproach,webelieve,allowsboth,researchersin gameplayingto find appropriatelearningtechniquesfor helpingto solve theirproblemsaswellasmachinelearningresearchersto identify rewardingtopicsfor further researchin game-playingdomains.The papercoverslearningtechniquesthat rangefromneuralnetworksto decisiontreelearningin gamesthatrangefrom pokerto chess.However, spaceconstraintsprevent us from giving detailedintroductionsto theusedlearningtechniquesor games.Overall, we aimedat striking a fair balancebetweenbeingexhaustiveandbeingexhausting.

�To appearin: J.Furnkranz& M. Kubat:MachinesthatLearnto PlayGames,NovaScientificPublishers,

Chapter2, pp. 11–59,Huntington,NY, 2001.

1

Page 2: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

1 Samuel’s Legacy

In 1947,Arthur L. Samuel,at thetime Professorof ElectricalEngineeringat theUni-versity of Illinois, cameup with the ideaof building a checkersprogram. Checkers,generallybeingregardedasimplergamethanchess,seemedto beaperfectdomainfordemonstratingthe power of symboliccomputingwith a quick programmingproject.The planwasstraight-forward:“write a programto play checkers[...] challengetheworld championandbeathim”. With this tiny project,hehopedto generateenoughinterestfor raisingfundsfor a universitycomputer.1

Little did Samuelknow that he would work on this programfor the next twodecades,therebyproducingnot only a master-level checkersplayer—it beatone ofAmerica’sbestplayersof thetimein agamethatis publishedin (FeigenbaumandFeld-man1963)—but alsointroducingmany importantideasin gameplayingandmachinelearning. The two mainpapersdescribinghis research(Samuel1959;Samuel1967)becamelandmarkpapersin Artificial Intelligence.In theseworks,Samuelnotonly pi-oneeredmany popularenhancementsof modernsearch-basedgame-playingprograms(like alpha-betacutoffs and quiescencesearch),but also inventeda wide variety oflearningtechniquesfor automaticallyimproving theprogram’sperformanceover time.In fact,heconsideredcheckersto bea perfectdomainfor thestudyof machinelearn-ing becausein games,many of thecomplicationsthatcomewith real-worldproblemsaresimplifiedallowing the researchersto focuson the learningproblemsthemselves(Samuel1959).As aresult,many techniquesthatcontributedto thesuccessof machinelearningasasciencecanbedirectlytracedbackto Samuel,andmostof Samuel’s ideasfor learningarestill in usein oneform or another.

First,his checkersplayerrememberedpositionsthatit frequentlyencountereddur-ing play. This simple form of rote learning allowed it to save time, and to searchdeeperin subsequentgameswhenever a storedpositionwasencounteredon theboardor in someline of calculation.Next, it featuredthefirst successfulapplicationof whatis now known asreinforcementlearning for tuningtheweightsof its evaluationfunc-tion. Theprogramtraineditself by playingagainsta stablecopyof itself. After eachmove, the weightsof the evaluationfunction wereadjustedin a way that moved theevaluationof theroot positionaftera quiescencesearchcloserto theevaluationof theroot positionaftersearchingseveral movesdeep.This techniqueis a variantof whatis nowadaysknown astemporal-differencelearningandcommonlyusedin successfulgame-playingprograms.Samuel’s programnot only tunedtheweightsof theevalua-tion but it alsoemployedon-linefeaturesubsetselectionfor constructingtheevaluationfunctionwith thetermswhichseemto bethemostsignificantfor evaluatingthecurrentboardsituation. Later, hechangedhis evaluationfunction from a linear combinationof termsinto a structurethatcloselyresembleda 3-layerneuralnetwork. This struc-turewastrainedwith comparisontraining from severalthousandpositionsfrom mastergames.

Sincethe daysof Samuel’s checkersprogramthe fields of machinelearningandgameplaying have gonea long way. Many of the new andpowerful techniquesthathave beendevelopedin theseareascanbe directly tracedbackto someof Samuel’s

1Thequotationandthefactsin thisparagrapharetakenfrom (McCorduck1979).

2

Page 3: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

ideas.His checkersplayeris still consideredto beamongthemostinfluentialworksinbothresearchareas,andaperfectexamplefor a fruitful symbiosisof thetwo fields.

1.1 Machine Learning

Machinelearninghassincedevelopedinto oneof the main researchareasin Artifi-cial Intelligencewith severaljournalsandconferencesdevotedto thepublicationof itsmainresults.Recently, thespin-off field KnowledgeDiscoveryin DatabasesakaDataMining (Fayyadet al. 1996;Fayyadet al. 1995;Witten andFrank2000)hasattractedtheinterestof theindustryandis consideredby many to beoneof thefastest-growingcommercialapplicationareasfor Artificial Intelligencetechniques.Machinelearninganddataminingsystemsareusedfor analyzingof telecommunicationsnetworkalarms(Hatonenet al. 1996),supportingmedicalapplications(Lavrac 1999),detectingcel-lular phonefraud(Fawcett andProvost1997),assistingbasketballtrainers(Bhandarietal. 1997),learningto playmusicexpressively (Widmer1996),efficiently controllingelevators(CritesandBarto 1998),automaticallyclassifyingcelestialbodies(Fayyadet al. 1996),mining documentson theWorld-WideWeb(Chakrabarti2000),and,lastbut not least,tuningtheevaluationfunctionof oneof thestrongestbackgammonplay-erson this planet(Tesauro1995). Many moreapplicationsaredescribedin (Michal-ski, Bratko,andKubat1998). For introductionsinto MachineLearningsee(Mitchell1997a)or (Mitchell 1997b),for anoverview of someinterestingresearchdirectionscf.(Dietterich1997).

1.2 Game Playing

Researchin gameplayinghasfulfilled oneof AI’ sfirst dreams,aprogramthatbeatthechessworld championin a tournamentgame(Newborn1996)and,oneyearlater, in amatch(Schaeffer andPlaat1997;Kasparov 1998). ThecheckersprogramCHINOOK

wasthefirst to win a humanworld championshipin any game(Schaeffer et al. 1996;Schaeffer 1997).Someof thesimplerpopulargames,like connect-4(Allis 1988),Go-moku (Allis 1994), or nine men’s morris aka mill (Gasser1995)have beensolved.Programsarecurrentlystrongcompetitorsfor the besthumanplayersin gameslikebackgammon(Tesauro1994), Othello (Buro 1997), and Scrabble(Sheppard1999).Researchis flourishingin many othergames,like poker(Billings etal. 1998a),bridge(Ginsberg 1999),shogi(Matsubara1998)or Go (Muller 1999),althoughhumansstillhave the competitive edgein thesegames. A brief overview of the state-of-the-artin computergameplaying canbe found in (Ginsberg 1998),a detaileddiscussionin(Schaeffer 2000),which waswritten for the occasionof the 40th anniversaryof thepublicationof (Samuel1960).

1.3 Introduction

In this paper, we will attemptto survey the largeamountof literaturethatdealswithmachine-learningapproachesto gameplaying. Unfortunately, spaceandtime do notpermitusto provide introductoryknowledgein eithermachinelearningor gameplay-ing (but seethe referencesgiven above). The main goal of this paperis to enable

3

Page 4: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

the interestedreaderto quickly find previousresultsthatarerelevant for her researchproject,sothatshemaystartherinvestigationsfrom there.

Thereareseveralpossiblewaysfor organizingthematerialin thispaper. Wecould,for example,have groupedit by thedifferentgames(chess,Go, backgammon,shogi,Othello,bridge,pokerto namea few morepopularones)or by thelearningtechniquesused(aswe have previouslydonefor thedomainof chess(Furnkranz1996)).Instead,we decidedto takea problem-orientedapproachandgroupedthemby thechallengesthat are posedin differentaspectsof the game. This, we believe, allows both, re-searchersin gameplayingto find appropriatelearningtechniquesfor helpingto solvetheir problemsaswell asmachinelearningresearchersto identify rewardingtopicsforfurtherresearchin game-playingdomains.

We will startwith a discussionof booklearning,i.e., for techniquesthatstorepre-calculatedmovesin a so-calledbookfor rapidaccessin tournamentplay (Section2).Next, we will addressthe problemof using learningtechniquesfor controlling thesearchproceduresthatarecommonlyusedin gameplayingprograms(Section3). InSection4, wewill review themostpopularlearningtask,namelytheautomatictuningof anevaluationfunction. We will considersupervisedlearning,comparisontraining,reinforcementandtemporal-differencelearning.In a separatesubsection,we will dis-cussseveralimportantissuesthatarecommonto theseapproaches.Thereafter(Section5), wewill survey variousapproachesfor automaticallydiscoveringpatternsandplans,moving from simpleadvice-takingover cognitive modelingapproachesto the induc-tion of patternsandplayingstrategiesfrom gamedatabases.Finally, in Section6, wewill briefly discussopponentmodeling, i.e., the taskof improving theprogram’s playby learningto exploit theweaknessesof particularopponents.

2 Book Learning

Humangameplayersnot only rely on their ability to estimatethevalueof movesandpositionsbutareoftenalsoableto playcertainpositions“by heart”,i.e.,withouthavingto think abouttheir next move. This is theresultof homepreparation,openingstudy,and rote learning of importantlines andvariations. As computersdo not forget, theuseof an openingbook providesan easyway for increasingtheir playing strength.However, theconstructionof suchopeningbookscanbequite laborious,andthe taskof keepingit up-to-dateis evenmorechallenging.

In this section,we will discusssomeapproachesthat supportthis processwithmachinelearningtechniques.We will not only concernourselveswith openingbooksin the strict sense,but summarizeseveral techniquesthat aim at improving play bypreparingthecorrectmove for importantpositionsthroughoff-line analysis.

2.1 Learning to choose opening variations

The ideaof usingopeningbooksto improve machine-playhasbeenpresentsincetheearly daysof computergame-playing.Samuel(1959)alreadyusedanopeningbookin hischeckersplayingprogram,asdid Greenblatt,EastlakeIII, andCrocker(1967)intheirchessprogram.Openingbooks,i.e.,pre-computedto repliesfor asetof positions,

4

Page 5: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

canbeeasilyprogrammedandarea simpleway for makinghumanknowledge,whichcanbefoundin game-playingbooks,accessibleto themachine.However, thequestionwhichof themany bookmovesa programshouldchooseis far from trivial.

Hyatt (1999) tacklesthe problemof learningwhich openinghis chessprogramCRAFTY shouldplayandwhichit shouldavoid. Heproposesa reinforcementlearningtechnique(cf. Section4.3) to solve this problem,usingthecomputer’s positionevalu-ationafter leaving the bookasanevaluationof theplayability of thechosenline. Inorderto avoid theproblemthatsomeopenings(gambits)aretypically underestimatedby programs,CRAFTY usesthemaximumor minimum(dependingonthetrend)of theevaluationsof the ten positionsencounteredimmediatelyafter leaving book. Specialprovisionsaremadeto takeinto accountthat valuesfrom deepersearchesaswell asresultsagainststrongeropponentsaremoretrust-worthy.

2.2 Learning to extend the opening book

Hyatt’s work relies on the presenceof a (manuallyconstructed)book. However—althoughthereis a lot of book informationreadilyavailablefor gamespopulargameslike chess—formany gamesno such information is available and suitableopeningbookshave to built from scratch.But evenfor gameslike chess,mechanismsfor auto-maticallyextendingor correctingexisting openingbooksaredesirable.In thesimplestcase,a programcouldextendits openinglibrary off-line, by identifying lines thatarefrequentlyplayedandcomputingthebestmove for thesepositions(Frey 1986).

In its matchesagainstKasparov, DEEP BLUE employeda techniquefor usingtheenormousgamedatabasesthatareavailablefor chessto constructanextendedopeningbook(Campbell1999). Theextendedopeningbookis usedin additionto theregular,manuallymaintainedopeningbook. It is constructedfrom all positionsup to the30thmove from a databaseof 700,000grandmastergames.At eachof thesepositions,onecould computean estimateof the goodnessof eachsubsequentmove by using theproportionof gameswon (draws countingas half a win) with this move. However,sucha techniquehassomeprincipleflaws, themostobviousbeingthata singlegameor analysismayrefutea variationthathasbeensuccessfullyplayedfor years.Hence,DEEP BLUE doesnot solely rely on this information,but performsa regular searchwhereit awardssmallbonusesfor movesthatarein theextendedopeningbook. Thebonusesarenot only basedon the win percentages,but alsoon the numberof timesa move hasbeenplayed,thestrengthof theplayersthatplayedthemove, annotationsthatwereattributedto themove, therecency of themove, andseveralothers.In total,thesecriteria canaddup to increase(or decrease)a move’s evaluationby abouthalfa pawn. This amountis small enoughto still allow the searchto detectand avoid(or exploit) faulty movesin thedatabase,but largeenoughto biasthemove selectiontowardsvariationsthat have beensuccessfullyemployedin practice. The procedureseemsto work quitewell, which is exemplifiedby thesecondgameof DEEP BLUE’sfirst matchwith Kasparov, where,dueto anconfigurationerror, its regularbookwasdisabledandDEEP BLUE reliedexclusively on its combinationof regularsearchandextendedbookto find a reasonablewaythroughtheopeningjungle(Campbell1999).

Buro (1999b)describesa techniquethat is usedin severalof thestrongestOthelloprograms.It incrementallybuilds anopeningbookby addingevery positionof every

5

Page 6: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

gameencounteredby the program(or from external sources)to a move tree. Theprogramevaluatesall positionsas it goes,andentersnot only the move playedbutalsothe next-bestmove accordingto its own evaluation. Thus,at eachnode,at leasttwo optionsareavailable,the move playedandan alternative suggestion.Eachleafnodeof the book treeis eitherevaluatedby the outcome(in caseit wasthe terminalpositionof agame)or aheuristicevaluation(in caseit wasanalternativeenteredby theprogram).Duringplay, theprogramfindsthecurrentpositionin thebookandevaluatesit with a negamaxsearchthroughits subtree.The useof negamaxsearchover thesegametreeswith addedcomputerevaluationsis muchmoreinformative thanstraight-forward frequency-basedmethods.As long asthemove treecanbekept in memory,this evaluationcanbe performedquite efficiently becauseall nodesin the book treearealreadyevaluatedand thus thereis no needfor keepingtrack of the positiononthe board. It is alsoquite straight-forwardto adjustthe behavior of the book by, forexample,specifyingthat draws areconsideredlosses,so that the programwill avoiddrawing booklines.2 Buro’sarticleis reprintedasChapter4 of (FurnkranzandKubat2001).

2.3 Learning from mistakes

A straight-forwardapproachfor learningto avoid to repeatmistakesis to remembereachpositionin which theprogrammadea mistakesothat it is alertwhenthis posi-tion is encounteredthe next time. The game-playingsystemHOYLE (Epstein2001)implementssuchan approach.After eachdecisive game,HOYLE looks for the lastpositionin whichthelosercouldhave madeanalternativemove andtriesto determinethe valueof this positionthroughexhaustive search.If the searchsucceeds,thestateis markedas“significant” andtheoptimalmove is recordedfor futureencounterswiththis position.If thesearchdoesnot succeed(andhencetheoptimalmove couldnotbedetermined),thestateis markedas“dangerous”.Moredetailsonthis techniquecanbefoundin (Epstein2001).

In conventional,search-basedgame-playingprogramssuchtechniquescaneasilybe implementedvia their transpositiontables(Greenblattet al. 1967;SlateandAtkin1983).Originally, transpositiontableswereonly usedlocally with theaim of avoidingrepetitive searchefforts (e.g.,by avoiding to repeatedlysearchfor theevaluationof aposition that canbe reachedwith differentmove orders). However, the potentialofusingglobaltranspositiontables,whichareinitializedwith asetof permanentlystoredpositions,to improveplayover a seriesof gameswassoonrecognized.

Oncemore,it wasSamuel(1959)whomadethefirst contributionin this direction.His checkersplayerfeatureda rote learningprocedurethatsimply storedevery posi-tion encounteredtogetherwith its evaluationso that it couldbe reusedin subsequentsearches.With suchtechniques,deepersearchesarepossiblebecauseon theonehandthe programis able to save valuabletime becausepositionsencounteredin memorydo not have to be re-searched.On the otherhand,if the searchencountersa stored

2Actually, Burosuggeststo discernbetweenpublic draws andprivatedraws.Thelatter—beinga resultof the program’s own analysisor experienceandthus,with somechance,not part of the opponent’sbookknowledge—couldbetried in the hopethat the opponentsmakesa mistake,while the formermay leadtoboringdrawswhenbothprogramsplay theirbookmoves(asis known from manychessgrandmasterdraws).

6

Page 7: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

positionduring the look-ahead,it is effectively ableto searchtheoriginal positiontoanincreaseddepthbecauseby re-usingthestoredevaluation,thesearchdepthatwhichthis evaluationhasbeenobtainedis in effect addedto the searchdepthat which thestoredpositionis encountered.A very similar techniquewasusedin the BEBE chessprogram,wherethe transpositiontablewas initialized with positionsfrom previousgames. It hasbeenexperimentallyconfirmedthat this simple techniquelearninginfact improves its scoreconsiderablywhenplaying 100-200gamesagainstthe sameopponent(Scherzeretal. 1990).

In gamelike chess,suchrotelearningtechniqueshelpin openingor endgameplay.In complicatedmiddlegamepositions,wheremostpiecesarestill ontheboard,chancesareconsiderablylower that the samepositionwill be encounteredin anothergame.Thus,in orderto avoid anexplosionof memorycostsby saving unnecessarypositions,Samuel(1959)alsodeviseda schemefor forgettingpositionsthat arenot or only in-frequentlyused.Otherauthorstried to copewith theseproblemsby beingselective inwhich positionsareaddedto the table. For example,Hsu(1985)tried to identify thefaulty movesin lostgamesby lookingfor positionsin whichthevalueof theevaluationfunctionsuddenlydrops. Positionsnearthatpoint werere-investigatedwith a deepersearch.If theprogramdetectedthatit hadmadea mistake,thepositionandthecorrectmove wereaddedto theprogram’sglobaltranspositiontable. If no singlemove couldbeblamedfor theloss,a re-investigationof thegamemoveswith a deepersearchwasstartedwith thefirst positionthatwassearchedafter leaving theopeningbook. Frey(1986)describestwo caseswhereanOthelloprogram(Hsu1985)successfullylearnedto avoid a previous mistake. Similar techniqueswererefinedlater (Slate1987)andwereincorporatedinto state-of-the-artgameplaying programs,suchasthe the chessprogramCRAFTY (Hyatt1999).

Baxter, Tridgell, andWeaver (1998b)alsoadoptthis technologybut discusssomeinconsistenciesand proposea few modificationsto avoid them. In particular, theyproposeto insertnot only thelosingpositionbut alsoits two successors.Every time apositionis inserted,aconsistency checkis performedto determinewhetherthepositionleadsto anotherbook position with a contradictoryevaluation, in which casebothpositionsarere-evaluated.This techniquehastheadvantagethatonly movesthathavebeenevaluatedby the computerareenteredinto the book, so that it never stumbles“blindly” into a badbookvariation.

2.4 Learning from simulation

Theprevioustechniquesweredevelopedfor deterministic,perfectinformationgameswhereevaluatinga positionis usuallysynonymousfor searchingall possiblecontinu-ationsto a fixeddepth.Someof themmaybehardto adaptfor gameswith imperfectinformation(e.g.,cardgameslike bridge)or a randomcomponent(e.g.,dice gameslike backgammon)wheredeepsearchesareinfeasibleandtechniqueslike storingpre-computedevaluationsin a transpositiontable do not necessarilylead to significantchangesin playing strengths. In thesecases,however, conventionalsearchcan bereplacedby simulationsearch (Schaeffer 2000),a searchtechniquewhich evaluatespositionsby playinga multitudeof gameswith this startingpositionagainstitself. Ineachof thesegames,the indeterministicparametersareassigneddifferent,concrete

7

Page 8: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

values(e.g.,by differentdicerolls or by dealingtheopponentsa differentsetof cardsor tiles). Statisticsarekeptover all thesegameswhicharethenusedfor evaluatingthequalityof themovesin thecurrentstate.

Tesauro(1995)notesthat such roll-outs canproducequite reliablecomparisonsbetweenmoves, even if the usedprogramis not of masterstrength. In the caseofbackgammon,suchanalyseshavesubsequentlyledtochangesin openingtheory(Rober-tie 1992,1993).Similartechniquescanbe(andindeedare)usedfor positionevaluationin gameslike bridge(Ginsberg 1999),Scrabble(Sheppard1999),or poker(Billingset al. 1999),andwereeventried asanalternative for conventionalsearchin thegameof Go (Brugmann1993). It would alsobeinterestingto exploretherespective advan-tagesof suchMonte-Carlosearchtechniquesandreinforcementlearning(see(SuttonandBarto1998)for a discussionof this issuein otherdomains).

3 Learning Search Control

A typical implementationof a search-basedgame-playingprogramhasa multitudeofparametersthatcanbeusedto makethesearchmoreefficient. Amongthemaresearchdepth,searchextensions,quiescencesearchparameters,move orderingheuristicsandmore.3 Donninger(1996,personalcommunication)considersanautomaticoptimiza-tion of searchparametersin his world-classchessprogramNIMZO asmorerewardingthantuningtheparametersof NIMZO’sevaluationfunction.Similarly, Baxter, Tridgell,andWeaver (2000)considertheproblemof “learningtosearchselectively” asareward-ing topic for furtherresearch.Surprisingly, thisareais moreor lessstill openresearchin game-playing.While therehave beenvariousattemptson the useof learningtoincreasethe efficiency of searchalgorithmsin other areasof Artificial Intelligence(Mitchell et al. 1983;Laird et al. 1987),this approachhasso far beenneglectedingameplaying.

Moriarty andMiikkulainen (1994)train a neuralnetworkwith a geneticalgorithmto pruneunpromisingbranchesof asearchtreein Othello.Experimentsshow thattheirselectivesearchmaintainsthesameplayinglevel asafull-width search.Similarly, Dahl(Dahl 2001)usesshape-evaluatingneuralnetworksto avoid searchingunpromisingmovesin Go.

Buro (1995a)introducesthe PROBCUT selective searchextension,whosebasicideais to usetheevaluationsobtainedby a shallow searchto estimatethevaluesthatwill beobtainedby a deepersearch.Therelationbetweentheseestimatesis computedby meansof linearregressionfrom theresultsof alargenumberof deepsearches.Dur-ing play, branchesthatseemunlikely to producea resultwithin thecurrentalpha-betasearchwindow arepruned.In subsequentwork,Buro(1999a)generalizedtheseresultsto allow pruningat multiple levels andwith multiple significancethresholds.Again,the techniqueproves to be useful in Othello, wherethe correlationbetweenshallowanddeepevaluationsseemsto beparticularlystrong.However, thisevaluationstabilitymight alsobeobserved in othergameswherea quiescencesearchis usedto evaluateonly stablepositions(Buro2000,personalcommunication).It is anopenquestionhow

3For moreinformationaboutalpha-betatechniquesandrelatedsearchalgorithmssee(Schaeffer 2000)for aquick introductionand(Schaeffer 1989)and(Plaat1996)for details.

8

Page 9: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

well thesealgorithmsgeneralizeto othergamesin comparisonto competingsuccessfulselective searchtechniques(like, e.g.,null-movepruning).

Other researcherstook a lessradicalapproachby trying to learnmove orderingheuristicsfor increasingthe efficiency of alpha-betasearch.More efficient move or-deringswill not directly changethe outcomeof a fixed-depthalpha-betasearchbutmayresultin significantlyfastersearches.Thesavedtimecanthenbeusedfor deepersearches,which may leadto betterplay. Greer, Ojha, andBell (1999)train a neuralnetworkto learnto identify theimportantregionsof a chesspositionandordermovesaccordingto theinfluencethey exhibit upontheseregions. Inuzukaet al. (1999)sug-gesttheuseof inductivelogic programming4 for learninga binarypreferencerelationbetweenmovesandpresentsomepreliminaryresultswith anOthelloplayer. A similarapproachwaspreviously suggestedby Utgoff andHeitman(1988)who usedmulti-variatedecisiontreesfor representingthe learnedpredicates.However, they usedthelearnedpreferencepredicatenotfor moveorderingin asearchalgorithm,but to directlyselectthebestmove in a givenposition(asin (Nakanoet al. 1998)). As the learnedpredicateneednot be transitive or even symmetric(cf. Section4.2), someheuristicprocedurehasto beusedto turn thepair-wisecomparisonsof all successorstatesintoa totalorder. Onecould,for example,treatthemasresultsbetweenplayersin a round-robintournamentandcomputethewinner, therebyrelyingonsometypicaltie-breakingmethodsfor tournaments(like theSonneborn-Bergersystemthat is frequentlyusedinchess).

However, for the purposesof move orderingin search,conventionaltechniquessuchasthehistoryheuristic(Schaeffer 1983)or thekiller heuristic(Akl andNewborn1977),arealreadyquiteefficientandhardto beat(Schaeffer 2000),sothatit is doubtfulthat above-mentionedlearningtechniqueswill find their way into competitive game-playing programs.Nevertheless,they area first stepinto theright directionandwillhopefullyspawn new work in this area.For example,thetuningof variousparametersthatcontrolsearchextensiontechniqueswouldbeaworth-whilegoal(Donninger1996,personalcommunication).It may turn out that this is harderthanevaluationfunctiontuning—whichwewill discussin thenext session—becauseit is hardto separatetheseparametersfrom thesearchalgorithmthatis controlledby them.

4 Evaluation Function Tuning

The mostextensively studiedlearningproblemin gameplaying is the automaticad-justmentof theweightsof anevaluationfunction.Typically, thesituationis asfollows:thegameprogrammerhasprovidedtheprogramwith alibrary of routinesthatcomputeimportantpropertiesof thecurrentboardposition(e.g.,thenumberof piecesof each

4Inductive logic programming(ILP) refersto a family of learningalgorithmsthat are able to inducePROLOG programsandcanthusrely on a moreexpressive conceptlanguagethanconventionallearningalgorithms,which operatein propositionallogic (Muggleton1992; Lavrac andDzeroski1993; De Raedt1995; MuggletonandDe Raedt1994). Its strengthsbecomeparticularly importantin domainswhereastructuraldescriptionof the trainingobjectsis of importance,like, e.g.,in describingmolecularstructures(Bratko andKing 1994;Bratko andMuggleton1995). They alsoseemto be appropriatefor manygameplayingdomains,in which a descriptionof the spatialrelationbetweenthe piecesis often moreimportantthantheiractuallocation.

9

Page 10: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

kind on theboard,thesizeof theterritorycontrolled,etc.).Whatis not known is howto combinethesepiecesof knowledgeandhow to quantifytheir relative importance.

The known approachesto solving this problemcanbe categorizedalongseveraldimensions.In what follows,we will discriminatethemby the typeof traininginfor-mationthey receive. In supervisedlearningtheevaluationfunctionis trainedon infor-mationaboutits correctvalues,i.e.,thelearnerreceivesexamplesof positionsor movesalongwith their correctevaluationvalues.In comparisontraining, it is providedwitha collectionof move pairsandtheinformationwhichof thetwo is preferable.Alterna-tively, it is givenacollectionof trainingpositionsandthemovesthathave beenplayedin thesepositions. In reinforcementlearning, the learnerdoesnot receive any directinformationabouttheabsoluteor relative valueof thetrainingpositionsor moves. In-stead,it receivesfeedbackfrom theenvironmentwhetherits movesweregoodor bad.In the simplestcase,this feedbacksimply consistsof the informationwhetherit haswon or lost thegame.Temporal-differencelearning is a specialcaseof reinforcementlearningwhichcanuseevaluationfunctionvaluesof laterpositionsto reinforceor cor-rect decisionsearlier in the game. This type of algorithm,however, hasbecomesofashionablefor evaluationfunctiontuningthatit deservesits own subsection.Finally,in Section4.5,wewill discussafew importantissuesfor evaluationfunctiontraining.

4.1 Supervised learning

A straight-forwardapproachfor learningtheweightsof anevaluationfunctionis topro-vide the programwith examplepositionsfor which the exact valueof the evaluationfunction is known. The programthentries to adjustthe weightsin a way that mini-mizesthe error of the evaluationfunctionon thesepositions.Theresultingfunction,learnedby linear optimizationor somenon-linearoptimizationtechniquelike back-propagationtrainingfor neuralnetworks,canthenbeusedto evaluatenew, previouslyunseenpositions.

Mitchell (1984)appliedsucha techniqueto learninganevaluationfunctionfor thegameof Othello (seealso(Frey 1986)). He selected180 positionsthat occurredintournamentplay afterthe44thmove andcomputedtheexactminimaxvaluefor thesepositionswith exhaustive search.Thesevalueswerethenusedfor computingappro-priateweightsof the28featuresof a linearevaluationfunctionby meansof regression.

In thegameof Othello,LeeandMahajan(1988)reliedonBILL, a—for thetime—very strongprogram5 that usedhand-craftedfeatures,to provide training examplesby playing a seriesof gamesagainstitself. Variety wasensuredby playing the first20 pliesrandomly. Eachpositionwaslabeledaswon or lost, dependingon theactualoutcomeof thegame,andrepresentedwith four differentnumericalfeaturescores(LeeandMahajan1990). Thecovariancematrix of thesefeatureswascomputedfrom thetrainingdataandthis informationwasusedfor learningseveralBayesiandiscriminantfunctions(onefor eachply from 24to 49),whichestimatedtheprobabilityof winningin a given position. The resultsshowed a greatperformanceimprovementover the

5In 1997, BI LL was testedagainstBuro’s LOGISTELLO andappearedcomparablyweak: runningonequalhardwareandusing20 minutespergameBI LL wasroughlyon parwith 4-ply LOGISTELLO, whichonly usedacoupleof secondspergame(Buro2000,personalcommunication).

10

Page 11: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

originalprogram.A similarprocedurewasusedby Buro(1995b).Hefurtherimprovedclassificationaccuracy by buildingacompletepositiontreeof all games.Interiornodeswere labelledwith the resultsof a fast negamaxsearch,which he alsousedfor hisapproachto openingbooklearning(seeSection2.2and(Buro2001)).

TesauroandSejnowski (1989)trainedthefirst neural-networkevaluationfunctionof theprogramthathaslater developedinto TD-GAMMON by providing it with sev-eralthousandexpert-ratedtrainingpositions.Thetrainingpositionswerefrom severalsources,textbook games,Tesauro’s games,andgamesof the network. In eachposi-tion, Tesauroratedseveral moves, in particularthe bestand the worst moves in theposition.Heunderlinedtheimportanceof having examplesfor badmovesin additionto goodmoves(otherwisetheprogramwouldtendto rateeachmoveasgood)aswell asthe importanceof having manuallyconstructedtrainingexamplesthataresuitableforguidingthelearnerinto specificdirections.In particular, importantbut rarelyoccurringconceptsshouldbe trainedby providing a numberof trainingpositionsthat illustratethepoint. Interestingis alsoTesauro’s techniqueof handlingunlabelledexamples(i.e.,legal movesin a trainingpositionthathave not beenratedby the expert). He claimsthatthey arenecessaryfor exposingthenetworkto awidervarietyof trainingpositions(thusescapinglocal minima). Hence,herandomlyincludedsomeof thesepositions,eventhoughhecouldonly labelthemwith a randomvalue.

Themainproblemof supervisedapproachesis to obtainappropriatelabelsfor thetraining examples. Automatedapproaches—computingthe game-theoreticvalue ortrainingon thevaluesof a strongprogram—arein generalonly of limited utility.6 Themanualapproachon theotherhandis very time-intensiveandcostly. Besides,expertsarenot usedto evaluategamestatesin a quantitativeway. While onecangeta rough,qualitative estimatefor the valueof a gamestate(e.g.,the qualitative annotations� ,�

, ��� andthe like thatarefrequentlyusedfor evaluatingchesspositions),it is hardfor anexpert to evaluatea positionin a quantitative way (e.g.,“White’s advantageisworth0.236pawns.”).

In someworks,providing exactvaluesfor trainingtheevaluationfunctionwasnotconsideredas important,but insteadthe training examplesweresimply regardedasmeansfor signalingthe learnerinto which directionit shouldadjustits weights. Forexample,in theabove-mentionedwork, TesauroandSejnowski (1989)did not expecttheir networkto exactly reproducethe giventraining information. In fact, overfittingthe training datadid hurt the performanceon independenttestdata,a commonphe-nomenonin machinelearning. Likewise, in (Dahl 2001),a neuralnetworkis trainedto evaluatepartsof Go positions,so-calledreceptive fields. Its traininginput consistsof a numberof positive examples,receptive fields in which the expert playedinto itscenter, andfor eachof thema negativeexample,anotherreceptive field from thesameposition,whichwaschosenrandomlyfrom thelegalmovesthattheexpertdid notplay.

Obviously, it is quite convenientto be able to useexpert movesa training infor-mation.While Dahl’sabove-mentionedapproachusesinformationaboutexpertmovesfor explicitly selectingpositive andnegative training examples,comparisontraining

6An exceptionis the gameof pokerwherethe computationof the correctaction in hindsight—if theopponents’cardsareknown—is fairly trivial. This approachwasusedin Waterman’s (1970)pokerplayer(cf. Section5.3).

11

Page 12: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

refersto approachesin which this informationcanbedirectly usedfor minimizing theerrorof theevaluationfunction.

4.2 Comparison training

Tesauro(1989a)introduceda new framework for trainingevaluationfunctions,whichhecalledcomparisontraining. Like in (Utgoff andHeitman1988,seeSection3), thelearneris notgivenexactevaluationsfor thepossiblemoves(or resultingpositions)butis only informedabouttheir relative order. Typically, it receivesexamplesin theformof move pairsalongwith a trainingsignalasto which of the two movesis preferable.However, thelearnerdoesnot learnanexplicit preferencerelationbetweenmovesasin(Utgoff andHeitman1988),but triesto usethiskind of traininginformationfor tuningtheparametersof anevaluationfunction(Utgoff andClouse1991). Thus,the learnerreceiveslesstraininginformationthanin thesupervisedsetting,but moreinformationthan in the reinforcementlearningsetting,which will be discussedin the next twosections.

In the generalcase,therearetwo problemswith suchan architecture:efficiencyandconsistency. For computingthebestamong� moves,theprogramhasto compare����� move pairs.Furthermore,thepredictedpreferencesneednot betransitive: thenetworkmightprefermove over � , � over � , and � over . In fact, it maynot evenbesymmetric:thenetworkmight prefer over � and � over , dependingon theorderinwhich themovesareprovidedasinputsto thenetwork.

Tesauro(1989a)solved theseproblemselegantly by enforcingthat the networkarchitectureis symmetric(the weightsusedfor processingthe first move must beequivalentto theweightsusedfor processingthesecondmove) andseparable(thetwosub-networksonly sharethesameoutputunit). In fact, this meansthat thesamesub-networkis usedin bothhalvesof theentirenetwork,with thedifferencethattheweightsthatconnectthehiddenlayerof onesub-networkto the outputunit aremultiplied by��� in theothersub-network.In practice,this meansthatonly onesub-networkhastobe storedandthat this networkactuallypredictsanabsoluteevaluationfor the movethat is encodedat its input. Consequently, only � evaluationshave to be performedfor determiningthe bestmove. An experimentalcomparisonbetweenthis trainingprocedureandthesupervisedproceduredescribedin the previoussectionproducedasubstantialperformancegain,evenin caseswherea simplernetworkarchitecturewasused(e.g.,in someexperiments,thecomparison-trainednetworksdid not usea hiddenlayer). Eventually, this trainingprocedureyieldedtheNEUROGAMMON program,thefirst game-playingprogrambasedprimarily onmachinelearningtechnologythatwonin tournamentcompetition(at theFirstComputerOlympiad,seeTesauro1989b,1990;Levy andBeal1989)

Morerecently, acomparisontrainingprocedurewasusedfor trainingachesseval-uationfunction,whereit madeacrucialdifferencein thefamous1997DEEP BLUE vs.Kasparov match.Thiswork is describedin detailin (Tesauro2001).

Conceptually, the training informationin Tesauro’s comparison-trainednetworksconsistsof a move pair. However, in effect, thelearneris only giventhemove playedin a collectionof trainingpositions.This informationis thenusedto minimizeanerrorfunctionoverall legalmovesin thatposition(see(Tesauro2001)for details).Likewise,

12

Page 13: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

Utgoff andClouse(1991)show how to encodestate(or move) preferencepredicatesasasystemof linearinequalitiesthatis nomorecomplex thanlearningfrom singlestates.

Similar techniquesfor reducingthepredictionerrorona setof moveshave alreadybeenusedearlier. Oncemore,it wasSamuel(1967)whofirst realizedsuchatechniquein thesecondversionof hischeckersplayer, wherea structureof stackedlinearevalu-ation functionswastrainedby computinga correlationmeasurebasedon thenumberof timesthefeatureratedanalternativemove higherthantheexpertmove.

Nitsche(1982)madethis approachclearerby explicitly definingtheso-calledse-lectionfunction, which is setto 1 for themove thatwasactuallyplayedin thetrainingposition,andto 0 for all othermoves.He thencomputedtheleast-squaresdifferencesbetweentheevaluationfunctionsestimatesof all move valuesandthis selectionfunc-tion. Marsland(1985)extendsthis work by suggestingthat theorderingof alternativemovesshouldbeincludedinto thelearningprocessaswell. Heproposescostfunctionsthatgive a higherpenaltyif theexpertmove hasa low rankingthanif it is consideredasoneof thebestmoves.Healsofavorsa costfunctionthatmakessurethatchangingtherank of a move nearthe top hasa biggereffect thanchangingthe rankof a movenearthebottom.

Similarly, vanderMeulen(1989)criticizedNitsche’s approachbecauseof theuseof a discreteselectionfunction. Insteadheproposesto calculatethe weightsfor theevaluationfunctionsusinga setof inequalitiesthatdefinea hyper-rectanglein param-eterspace,in which theevaluationfunctionwouldchoosethesamemove thatis speci-fied in thetrainingpositions.Theintersectionof severalsuchhyper-rectanglesdefinesa regionwheretheevaluationfunctionfindsthecorrectmove for severaltestpositions.Van derMeulen’s algorithmgreedily identifiesseveral suchregionsandconstructsasuitableevaluationfunction for eachof them. New positionsare then evaluatedbyfirst identifying anappropriateregion andthenusingthe associatedevaluationfunc-tion. However, this algorithmhasnot beentested,andit is anopenquestionwhetherthe numberof optimal regionscanbekept low enoughfor anefficient applicationinpractice.

Hsuet al. (1990a)usedanautomatictuningapproachfor thechessprogramDEEP

THOUGHT, becausethey wereconvincedthatit is infeasibleto correctlyreflectthein-teractionsbetweenthemorethan100weightsof theevaluationfunctions.Like Nitsche(1982) they choseto computethe weightsby minimizing the squarederror betweenthe program’s anda grandmaster’s choiceof moves. Whenever an alternative movereceivedahigherevaluationthanthegrandmaster’smove (aftera5 to 6 ply search)theweightsof thedominantposition(seeSection4.5)wereadjustedto decreasethisdiffer-ence.Theauthorswereconvincedthattheirautomaticallytunedevaluationfunctionisno worsethanthehand-tunedfunctionsof their academiccompetitors.However, theyalsoremarkedthattheevaluationfunctionsof topcommercialchessprogramsarestillbeyondthem,but hopedto closethegapsoon.Subsequentwork onDEEP THOUGHT’ssuccessorDEEP BLUE is describedin (Tesauro2001)

Schaeffer etal. (1992)reportthattheuseof similartechniquesfor tuningacheckersevaluationfunctionon800Grandmastergamesyieldedreasonableresults,but wasnotcompetitivewith theirhand-tunedweightsettings.TheGOLEM Goprogram(Enderton1991)useda very similar techniquefor traininga neuralnetworkto forward-prunein

13

Page 14: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

a one-plysearch.However, insteadof comparingthemaster’s move to all alternatives(which canbeprohibitively many in Go), they requiredthat themaster’s move hastobebetterthanarandomalternativeby acertainminimummargin.

Tunstall-Pedoe(1991)tried to optimizetheweightsof anevaluationfunctionwithgeneticalgorithms,7 wherethefitnessof a setof weightswasevaluatedwith theper-centageof testpositionsfor which theevaluationfunctionchosethemove in thetrain-ing set.A similar techniquewasalsosuggestedfor traininganeuralnetwork(Tiggelen1991).

Comparisontrainingworkedwell in many of theabove-mentionedapproachesbutit alsohassomefundamentalshortcomings.For example,Schaeffer et al. (1992)ob-servedthataprogramtrainedonexpertgameswill imitateahumanplayingstyle,whichmakesit harderfor theprogramto surprisea humanbeing. This point is of particularimportancein a gamelike checkers,wherea relatively high percentageof the gamesendin bookdraws. On theotherhand,this fact couldalsobeusedfor adaptingone’splay towardsa particularopponent,ashasbeensuggestedin (CarmelandMarkovitch1993,seeSection6). Buro (1998)pointedout that the training positionsshouldnotonly berepresentative for thepositionsencounteredduringplay, but representative forthepositionsencounteredduringsearch. Thismeansthatalot of lop-sidedandcounter-intuitivepositionsshouldneverthelessbeincludedin thetrainingsetbecausethey willbeencounteredby theprogramduringanexhaustive search.Lastbut not least,usingexpertgamesfor trainingmakestheoftenunjustifiedassumptionthat themove of themasteris actuallythebestandit alsodoesnot give you additionalinformationaboutthe rankingof thealternative moves,someof which might beat leastasgoodasthemoveactuallyplayed.It wouldbemuchmoreconvenient(andmoreelegant)if aplayercould learnfrom its own experience,simply from playingnumerousgamesandusingthereceivedfeedback(wonor lostgames)for adjustingits evaluationfunction.This iswhatreinforcementlearningis about.

4.3 Reinforcement learning

Reinforcementlearning (Suttonand Barto 1998) is bestdescribedby imagininganagent that is able to take several actionswhosetask is to learn which actionsaremost preferablein which states. However, contraryto the supervisedlearningset-ting, the agentdoesnot receive training informationfrom a domainexpert. Instead,it may explore the differentactionsand,while doing so, will receive feedbackfromtheenvironment—theso-calledreinforcementor reward—which it canuseto ratethesuccessof its own actions.In agame-playingsetting,theactionsaretypically thelegalmoves in the currentstateof the game,andthe feedbackis whetherthe learnerwinsor losesthegameor by whichmargin it doesso. We will describethis settingin more

7A geneticalgorithm(Goldberg 1989)is a randomizedsearchalgorithm. It maintainsa populationofindividuals that are typically encodedasstringsof 0’s and1’s. All individualsof a so-calledgenerationareevaluatedaccordingto their fitness,andthefittest individualshave the highestchanceof surviving intothenextgenerationandof spawningnew individualsthroughthegeneticoperatorscross-overandmutation.For moredetails,seealso(KojimaandYoshikawa2001),which discussestheuseof geneticalgorithmsforlearningto solvetsume-goproblems.

14

Page 15: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

detail usingMENACE, the MatchboxEducableNoughtsAndCrossesEngine(Michie1961,1963),which learnedto play thegameof tic-tac-toeby reinforcement.

MENACE hasoneweightassociatedwith eachof the287differentpositionswiththefirst playerto move(rotatedor mirroredvariantsof identicalpositionsweremappedto a uniqueposition). In eachstate,all possibleactions(all yet unoccupiedsquares)areassigneda weight. The next actionis selectedat random,with probabilitiescor-respondingto the weightsof thedifferentchoices.Dependingon theoutcomeof thegame,the movesplayedby the machinearerewardedor penalizedby increasingordecreasingtheir weight. Drawing the gamewas considereda successandwas alsoreinforced(albeitby asmalleramount).

MENACE’s first implementationconsistsof 287 matchboxes,one for eachstate.Theweightsfor thedifferentactionsarerepresentedwith anumberof beadsin differentcolors,onecolor for eachaction.Theoperatorchoosesamove by selectingthematch-box that correspondsto the currentmove, anddrawing a singlebeadfrom it. Movesthat have a higherweight in this positionhave morebeadsof their color in the boxandthereforea higherchanceof beingselected.After completinga game,MENACE

receivesreinforcementthroughthe informationwhetherit haswon or lost the game.However, therewardis not receivedaftereachmove,but at theendof thegame.If theprogrammakesa goodmove, this is not immediatelypointedout,but it will receive adelayedreward by winning thegame.8 In MENACE’s case,positionsarerewardedorpenalizedby increasing/decreasingtheirassociatedweightsby adding/removing beadsto/fromeachmatchboxthatwasusedin thegame.

The main problemthat hasto besolved by the learneris the so-calledcredit as-signmentproblem(Minsky 1963),i.e., theproblemof distributingthereceivedrewardto the actionsthat wereresponsiblefor it. This problemandsomedomain-specificapproachestowardsits solutionalreadyarosein Section2.3: In a lost game,only onemove mightbetheculprit thatshouldreceive all of thenegativereward,while all othermovesmight have beengoodmoves.However, it is usuallynotknown to theprogramwhich move wasthemistake.Michie hadtwo proposalsfor solvingthecreditassign-mentproblem. The first techniquesimply givesall moves in the gameequalcredit,i.e., thesameamountof reinforcement(onebead)is added(removed) from all boxesthathave beenusedin thewon (lost) game.Thesecondtechniqueassumesthatposi-tions later in the gamehave a biggerimpacton the outcomethanpositionsearlierinthegame.This wasrealizedby initializing positionsearlierin thegamewith a highernumberof beads,sothataddingor removing a singlebeadresultsin a smallerchangein theweightsfor theactionsof thatstate.Thissimpletechniquedoesnotpreventgoodpositionsfrom receiving negativefeedback(if amistakewasmadelaterin thegame)orbadpositionsfrom positivereward(if thegamewaswonbecausetheopponentdid not

8In cardgameslike Hearts,feedbackis oftenavailableaftereachindividualmovein theform of whethertheplayermadethetrick or how manypointshemadein thetrick Kuvayev (1997)madeuseof thisinforma-tion for traininganevaluationfunctionthatpredictsthenumberof pointstheprogramwill makeby playingeachcard. In this setting,thedistinctionbetweensupervisedlearningandreinforcementlearningbecomesunclear:on theonehand,the learnerlearnsfrom thereinforcementit receivesaftereachtrick, while on theotherhandit canusestraight-forwardsupervisedlearningtechniquesto do so. However, if weconsiderthepossibilityof “shootingthemoon”,whereoneplayermakesall thenegativepointswhich arethencountedaspositive,therewardhasto bedelayedbecauseit maynotbeclearimmediatelyafterthetrick, whetherthepointsin it arepositiveor negative.

15

Page 16: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

exploit themistake).However, the ideais thataftermany games,goodpositionswillhave receivedmorepositivethannegativerewardandviceversa,sothattheevaluationfunctioneventuallyconvergesto a reasonablevalue.

In themean-time,convergencetheoremsfor reinforcementlearninghaveconfirmedthis proposition(SuttonandBarto1998).Michie, however, hadto rely on experimen-tal evidence,first with thephysicalmachineplayingagainstahumantutor, thenwith asimulationprogramplayingtournamentsagainsta randomplayerandagainstaninde-pendentlylearningcopyof itself. TheresultsshowedthatMENACE madecontinuousprogress.After a few hundredgames,theprogramproducednear-expertplay (Michie1963).

However, MENACE hasseveralobviousshortcomings.Thefirst is theuseof alook-up table.Every statehasto beencodedasa separatetableentry. While this is feasiblefor tic-tac-toe,morecomplex gamesneedsomesort of generalizationover differentstatesof the gamesbecausein chess,for example,it is quite unlikely that the samemiddlegamepositionwill appeartwice in a program’s life-time. More importantly,however, training on the outcomeof the gameis very slow and a large numberofgamesareneededbeforepositionevaluationsconvergeto reasonablevalues.Temporal-differencelearning introduceda significantimprovementin thatregard.

4.4 Temporal-difference learning

A smallrevolutionhappenedin thefieldof reinforcementlearningwhenGeraldTesauropresentedhis first resultson traininga backgammonevaluationfunctionby temporal-differencelearning(Tesauro1992a,1992b). The basic idea of temporal-differencelearningis perhapsbestunderstoodif oneconsidersthesupervisedlearningapproachtakenin (LeeandMahajan1988)whereeachencounteredpositionis labelledwith thefinal outcomeof the game. Naturally, this procedureis quite error-prone,asa gamein which the player is bettermostof the time canbe lost by a singlemistakein theendgame.All positionsof thegamewould thenbe labelledasbad,mostof themun-justly.9 If, ontheotherhand,eachpositionis trainedonthepositionevaluationseveralmovesdeeperin the game,the weightadjustmentsareperformedmoreintelligently:positionscloseto the final positionwill bemoved towardsthe evaluationof the finalposition. However, in the caseof a mistakelate in the game,the effect of the finaloutcomewill notbevisible in theearlyphasesof thegame.

Sutton’s (1988)TD( � ) learningframework, providesan elegant formalizationoftheseidea.He introducestheparameter� which is usedto weighttheinfluenceof thecurrentevaluationfunctionvaluefor weightupdatesof previousmoves. In thesimplecaseof a linearevaluationfunction������� ������� � �"!#� �����

9In theexperimentsof LeeandMahajan(1988),theinitial phaseof randomplayoftenyieldedlop-sidedpositionswhich couldeasilybe evaluatedby their trainingprogram.However, in othergames,onehastorely on humantraininginformation,which is naturallyerror-prone.

16

Page 17: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

theweightsareupdatedaftereachmove accordingto theformula

� �%$ &�')( ��� ��$ & �+* �%�,��� &�')( � � �,��� & �-�&�.0/ ( �

&21 . ! � ��� . �where * is a learningratebetween3 and � and

�,��� & �is theevaluationfunctionvalue

for thepositionafterthe 4 -th move.To understandthis, considerfor a momentthecase�5�63 . In this case,thesum

at the endreducesto the singlevalue! � ��� & �

. The entire formula thenspecifiesthateachfeatureweightis updatedby anamountthat is proportionalto theproductof theevaluationterm

! � ��� & �andthedifferencebetweenthesuccessivepredictions

�,�%� & �and�,��� &�')( �

. Theresultof thiscomputationis thattheweightsareadjustedin thedirectionof thetemporaldifferenceaccordingto thedegreeto whichthey contributedto thefinalevaluation

����� & �. At a settingof ��7�83 , the degreeto which the featurecontributed

to thefinal evaluationof previouspositionsis alsotakeninto consideration,weightedby � &21 . if therespective positionoccurred9 movesago.A valueof �:�;� assignsthesameamountof creditor blameto all previousmoves.It canbeshown thatin thiscasethe sumof theseupdatesover an entiregamescorrespondsto a supervisedlearningapproachwhereeachpositionis labelledwith thefinal outcomeof thegame(which isassignedto

��< ')(if thegamelastedfor � moves). For moredetails,in particularfor

a discussionof TD( � )’s soundformalbasisandits variousconvergencepropertiessee(SuttonandBarto1998).

Tesaurohasappliedthis proceduresuccessfullyto train a neural-netevaluationfunctionfor his backgammonplayer. After solvingseveralof thepracticalissuesthatare involved in training multi-layer neuralnetworkswith TD( � ) on a complex task(Tesauro1992a),theresultingprogram—TD-GAMMON—clearlysurpassedits prede-cessors,in particulartheComputerOlympiadchampionNEUROGAMMON, whichwastrainedwith comparisontraining (Tesauro1989a,1990),andit did so entirely fromplaying againstitself. In fact, early versionsof TD-GAMMON, which only usedtheraw boardinformationasfeatures,alreadylearnedto playaswell asNEUROGAMMON,whichusedasophisticatedsetof features.Adding thesefeaturesto theinput represen-tationfurtherimprovedTD-GAMMON’splayingstrength(Tesauro1992b).Over time(Tesauro1994,1995),TD-GAMMON notonly increasedthenumberof traininggamesthat it playedagainstitself, but Tesauroalsoincreasedthe searchdepthandchangedthenetworkarchitecture,sothatTD-GAMMON reachedworld-championshipstrength.

However, despiteTD-GAMMON’soverwhelmingsuccess,its superiorityto its pre-decessorsthatweretrainedwith supervisedlearningor comparisontrainingdoesnotallow theconclusionthattemporal-differencelearningis thebestsolutionfor all games.For example,Samuel(1967),in thefollow-uppaperonhisfamouscheckersplayer, hadarrivedata differentresult:comparisontrainingfrom about150,000expertmovesap-pearedto befasterandmorereliablethantemporal-differencelearningfrom self-play.Similarly, Utgoff andClouse(1991)demonstratedthata variantof comparisontrain-ing performedbetterthantemporal-differencelearningon a simpleproblem-solvingtask.They proposedanintegrationof bothmethodsthatusescomparisontrainingonlywhenthetemporal-differenceerrorexceedsa user-definedthreshold,andshowedthat

17

Page 18: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

this approachoutperformedits predecessors.They alsoshowed that the numberofqueriesto theexpert (neededfor comparisontraining)decreasesover time.

Many authorshave tried to copy TD-GAMMON’s learningmethodologyto othergames,suchastic-tac-toe(Gherrity1993),Othello(Isabelle1993;LeouskiandUtgoff1996),chess(Gherrity1993;Schmidt1994),Go (Schraudolphet al. 1994),connect-4(Gherrity1993;Sommerlund1996),checkers(Lynch1997),or Hearts(Kuvayev 1997)with mixedsuccess.An applicationusingTD-learningto learnto evaluatethesafetyof groupsin Go from self-playis discussedin (Dahl2001).

Noneof thesesuccessors,however, achieveda performancethatwasasimpressiveasTD-GAMMON’s.Thereasonfor thisseemsto bethatbackgammonhasvariouschar-acteristicsthat makeit perfectlysuitedfor learningfrom self-play. Foremostamongthesearethefact thatit only requiresavery limited amountof searchandthatthedicerolls guaranteesufficientvariability in thegamessothatall regionsof thefeaturespaceareexplored.

Chess,for example,doesnot sharetheseproperties.Although this problemhasbeenknown earlyonand,in fact,severalsolutionattemptshave beenknown sincethedaysof Samuel,it tookseveralyearsandmany unsuccessfulattemptsbeforeasuccess-ful chessprogramcouldbetrainedby self-play. For example,Gherrity(1993)tried totrain a neuralnetworkwith Q-learning(a temporal-differencelearningprocedurethatis quitesimilar to TD(0); WatkinsandDayan1992)againstGNUCHESS, but, aftersev-eralthousandgames,hisprogramwasonly ableto achieve 8 drawsby perpetualcheck.The NEUROCHESS program(Thrun1995)alsousedTD(0) for traininga neuralnet-work evaluationfunctionby playingagainstGNUCHESS. Its innovationwastheuseofa secondneuralnetworkthatwastrainedto predicttheboardpositiontwo pliesafterthecurrentposition(using120,000expertgamesfor training).Theevaluationfunctionwasthentrainedwith its own estimatesfor thepredictedpositionsusingaTD(0) algo-rithm. NEUROCHESS steadilyincreasedits winningpercentagefrom 3%to about25%afterabout2000traininggames.

Finally, Baxter, Tridgell,andWeaver (1998a)wereableto demonstrateasignificantincreasein playingstrengthusingTD-learningin a chessprogram.Initially, their pro-gram,KNIGHTCAP, knew only aboutthevalueof thepiecesandwasratedat around1600on anInternetchessserver. After about1000gameson theserver, its ratinghadimprovedto over 2150,which is animprovementfrom anaverageamateurto a strongexpert. The crucial ingredientsto KNIGHTCAP’s successwere the availability of awide variety of training partnerson the chessserver anda soundintegrationof TD-learninginto theprogram’ssearchprocedures.Wewill briefly discusstheseaspectsinthenext sectionbut moredetailscanbefoundin (Baxter, Tridgell, andWeaver 2001).

4.5 Issues for evaluation function learning

Therearemany interestingissuesin learningto automaticallytuneevaluationfunc-tions.We will discusssomeof themin a little moredetail,namelythechoiceof linearvs. non-linearevaluationfunctions,theissueof how to train thelearneroptimally, theintegrationof evaluationfunction learningandsearch,andautomatedwaysfor con-structingthefeaturesusedin anevaluationfunction.

18

Page 19: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

4.5.1 Linear vs. non-linear evaluation functions

Most conventionalgame-playingprogramsdependon fastsearchalgorithmsandthusrequireanevaluationfunctionthatcanbequickly evaluated.A linearcombinationofa few featuresthatcharacterizethe currentboardsituationis anobviouschoicehere.Manualtuningof theweightsof a linearevaluationfunctionis comparablysimple,butalreadyvery cumbersome.Not only the individual evaluationtermsmay dependoneachother, sothatsmallchangesin oneweightmayaffectthecorrectnessof thesettingsof otherweights,but alsoall weightsdependon thecharacteristicsof theprograminwhich they areused.For example,the importanceof beingableto recognizetacticalpatternssuchasfork threatsmaydecreasewith theprogram’s searchdepthor dependon theefficiency of theprogram’squiescencesearch.

However, advancesin automatedtuningtechniqueshaveevenmadetheuseof non-linear function approximatorsfeasible. Samuel(1967)alreadysuggestedthe useofsignature tables, a non-linear, layeredstructureof look-uptables.Clearly, non-lineartechniqueshave theadvantagethatthey canapproximatea muchlargerclassof func-tions. However, they arealsoslower in trainingandevaluation.Hencethequestioniswhetherthey arenecessaryfor a goodperformance.In many aspects,this problemisreminiscentof the well-known search/knowledgetrade-off (Berliner1984;Schaeffer1986;Berlineretal. 1990;JunghannsandSchaeffer 1997).

LeeandMahajan(1988)interpretedthebig improvementthat they achievedwithBayesianlearningover a hand-crafted,linear evaluationfunction as evidencethat anon-linearevaluationfunction is betterthana linear evaluationfunction. In particu-lar, they showed that the covariancematricesupon which their Bayesianevaluationfunction is based,exhibit positive correlationsfor all termsin their evaluationfunc-tion, which refutesthe independenceassumptionsthatsometrainingprocedureshaveto make.

Ontheotherhand,Buro(1995b)haspointedoutthataquadraticdiscriminantfunc-tion mayreduceto a linearform if thecovariancematricesfor winningandlosingare(almost)equal. In fact, in his experimentsin Othello,heshowed thatFisher’s lineardiscriminantobtainsevenbetterresultsthana quadraticdiscriminantfunctionbecauseboth areof similar quality but the former is fasterto evaluate(which allows thepro-gramto searchdeeper).10 Consequently, Buro (1995b)arguesthatthepresenceof fea-turecorrelations,whichmotivatedtheuseof non-linearfunctionsin (LeeandMahajan1988),is only a necessarybut not a sufficient conditionfor switchingto a non-linearfunction. It only paysoff if thecovariancematricesaredifferent.

Likewise,Tesauro(1995)remarkedthathisnetworkstendto learnelementarycon-ceptsfirst,andthatthesecantypically beexpressedwith alinearfunction.In fact,in hispreviouswork (Tesauro1989a),henoticedthata single-layerperceptrontrainedwithcomparisontrainingcanoutperforma non-linearmulti-layerperceptronthatlearnedinasupervisedway. Nevertheless,Tesauro(1998)is convincedthateventuallynon-linearstructureis crucialto asuccessfulperformance.Heclaimsthatthenetworksof PollackandBlair (1998)thatweretrainedby astochastichill-climbing procedure(cf. next sec-

10Buro (1995b)alsoshowedthata logistic regressionobtainsevenbetterresultsbecauseit doesnot makeanyassumptionsontheprobabilitydistributionsof thefeatures.

19

Page 20: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

tion) areinferior to thosetrainedwith TD-learningbecausethey areunableto capturenon-linearity.

A popularandcomparablysimpletechniquefor achieving non-linearity, alsoorig-inatingwith Samuel,is to usedifferentevaluationfunctionsfor differentgamephases.Tesauro’s earlyversionof TD-GAMMON, for example,ignoredimportantaspectslikedoublingor the runninggame(whenthe two opponents’tiles arealreadyseparated)becausesufficiently powerful algorithmsexistedfor thesegameparts. Boyan(1992)took this furtherandshowedthattheuseof differentnetworksfor differentsubtasksofthegamecanimprove over learningwith a singleneuralnetworkfor theentiregame.For example,he used8 networksin tic-tac-toeandtrainedeachof themto evaluatepositionswith = pieceson the board( =>�?�A@B@ C ). In backgammon,he used12 net-worksfor differentclassesof pipcounts.11 Similarly, LeeandMahajan(1988)used26evaluationfunctions,onefor eachply from 24 to 49, for trainingtheir OthelloplayerBILL. Training informationwassmoothedby sharingtrainingexamplesbetweenthepositionsthatwerewithin a distanceof 2 plies.

Thereareseveralpracticalproblemsthathave to besolvedwith suchapproaches.One is the so-calledblemisheffect (Berliner 1979), which refersto the problemofkeepingthevaluesreturnedby thedifferentevaluationfunctionsconsistent.

4.5.2 Training strategies

Anotherimportantissueis how to provide thelearnerwith traininginformationthatison theonehandfocussedenoughto guaranteefastconvergencetowardsa goodevalu-ationfunction,but ontheotherhandprovidesenoughvariationto allow thelearningofgenerallyapplicablefunctions. In particularin learningfrom self-playit is importantto ensurethat theevaluationfunctionis subjectto enoughvarietyduringtrainingthatpreventsit from settlingata localminimum.In reinforcementlearning,thisproblemisknown asthetrade-off betweenexploration of new optionsandexploitationof alreadyacquiredknowledge.

Interestingly, Tesauro’s TD-GAMMON did not have this problem. The reasonisthat the roll of the dice beforeeachmove ensureda sufficient variety in the trainingpositionssothat learningfrom self-playworkedvery well. In fact, PollackandBlair(1998)claimedthatTD-GAMMON’sself-playtrainingmethodologywastheonly rea-son for its formidableperformance.They back up their claim with experimentsinwhich they self-traina competitive backgammonplaying networkwith a rathersim-ple, straight-forwardstochastichill-climbing searchinsteadof Tesauro’s sophisticatedcombinationof temporal-differencesearchandback-propagation.The methodologysimply performsrandommutationson thecurrentweightsof thenetworkandletsthenew versionplay againsttheold version. If the formerwins in a seriesof games,theweight changesarekept, otherwisethey areundone.Although Tesauro(1998)doesnot entirelyagreewith theconclusionsthatPollackandBlair (1998)derive from thisexperiment,thefact thatsuchatrainingprocedureworksatall is remarkable.

On theotherhand,Epstein(1994c)presentssomeexperimentalevidencethatself-training indeeddoesnot performwell in the caseof deterministicgamessuchastic-

11Pipcountsareacommonlyusedmetricfor evaluatingtheprogressin abackgammongame.

20

Page 21: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

tac-toeor achi. Her game-learningsystemHOYLE (Epstein1992),which is basedontheFORR architecturefor learningandproblemsolving(Epstein1994a),wastrainedwith self-play, a perfectopponent,a randomlyplaying opponent,andvariousfallibleopponentsthatmaderandommoves �D of the time andplayedthe othermovesper-fectly. Her resultshowedthat neitherthe perfector randomtrainers,nor learningbyself-playproducedoptimalresults.Increasingthepercentageof randommovesof thefallible trainers,whichencouragesexploration,usuallyincreasedthenumberof gamesthey lost without increasingthe numberof gameswon againstan expert benchmarkplayer. As a consequence,Epstein(1994c)proposedlesson-and-practicetraining, inwhichphasesof trainingwith anexpertplayerareintertwinedwith phasesof self-play,anddemonstratedits superiority. More detailson HOYLE canbefoundin Section5.2and,in particular, in (Epstein2001).

Potentialproblemswith self-playwerealreadynotedearlier, anda variety of al-ternative strategieswereproposed.Samuel(1959)suggesteda simplestrategy: hischeckersplayer learnedagainsta copy of itself that hadits weightsfrozen. If, aftera certainnumberof games,thelearningcopymadeconsiderableprogress,thecurrentstateof the weightswastransferredto the frozencopy, andtheprocedurecontinued.Occasionally, whennoprogresscouldbedetected,theweightsof theevaluationfunc-tion werechangedradicallyby settingthelargestweightto 0. Thisprocedurereducedthechanceof gettingstuckin a localoptimum.A similar techniquewasusedby Lynch(1997).

LeeandMahajan(1988)andFogel(1993)ensuredexplorationof differentregionsof theparameterspaceby having theirOthelloandtic-tac-toeplayerslearnfrom gamesin which thefirst moveswereplayedrandomly. Similarly, Schraudolphet al. (1994)usedGibbs samplingfor introducingrandomnessinto the move selection. Boyan(1992)insertedrandomnessinto theplay of his program’s tic-tac-toetrainingpartner.AngelineandPollack(1994)avoidedthisproblemby evolving severaltic-tac-toeplay-ersin parallelwith a geneticalgorithm. The fitnessof theplayersof eachgenerationwasdeterminedby having themplaya tournament.

Finally, thesuccessof thechessprogramKNIGHTCAP (Baxteret al. 1998b)thatemployedTD-learningwasnot only dueto theintegrationof TD-learninginto search(cf. next section)but alsoto the fact that theprogramlearnedby playingon anInter-net chessserver whereit could learnfrom a wide varietyof opponentsof all playingstrengths.As humanplayerstendto matchwith opponentsof approximatelytheirownstrength(i.e., playerswith a similar rating),thestrengthof KNIGHTCAP’s oppositionincreasedwith its own playingstrength,which intuitively seemsto bea goodtrainingprocedure(Baxter, Tridgell, andWeaver 2001).

4.5.3 Evaluation function learning and search

In backgammon,deepsearchesarepracticallyinfeasiblebecauseof the largebranch-ing factor that is dueto the chanceelementintroducedby the useof dice. However,deepsearchesarealsobeyond the capabilitiesof humanplayerswhosestrengthliesin estimatingthe positionalvalueof the currentstateof the board. Contraryto thesuccessfulchessprograms,who caneasilyout-searchtheir humanopponentbut stilltrail herability of estimatingthepositionalmeritsof thecurrentboardconfiguration,

21

Page 22: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

TD-GAMMON wasable to excel in backgammonfor the samereasonsthat humansplaywell: its graspof thepositionalstrengthsandweaknesseswasexcellent.

However, in gameslike chessor checkers,deepsearchesarenecessaryfor expertperformance.A problemthat hasto be solved for thesegamesis how to integratelearninginto thesearchtechniques.In particularin chess,onehastheproblemthatthepositionat theroot of thenodeoftenhascompletelydifferentcharacteristicsthantheevaluationof thenode. Considerthesituationwhereoneis in themiddleof a queentrade.Thecurrentboardsituationwill evaluateas“being onequeenbehind”,while alittle bit of searchwill show that the positionis actuallyeven becausethequeencaneasilybe recapturedwithin the next few moves. Straight-forwardapplicationof anevaluationfunctiontuningalgorithmwould thensimply try to adjusttheevaluationofthecurrentpositiontowardsbeingeven.Clearly, thisis nottheright thingto dobecausesimpletacticalpatternslike piecetradesaretypically handledby thesearchandneednot berecognizedby theevaluationfunction.

Thesolutionfor this problemis to basetheevaluationon thedominantpositionofthesearch.Thedominantpositionis theleaf positionin thesearchtreewhoseevalua-tion hasbeenpropagatedbackto therootof thesearchtree.Mostconventionalsearch-basedprogramsemploysomeform of quiescencesearchto ensurethat this evaluationis fairly stable.Usingthedominantpositioninsteadof therootpositionmakessurethattheestimationof theweightadjustmentsis basedon thepositionthatwasresponsiblefor theevaluationof thecurrentboardposition. Not surprisingly, this problemhasal-readybeenrecognizedandsolvedby Samuel(1959)but seemedto have beenforgottenlateron. For example,Gherrity(1993)publisheda thesisona systemarchitecturethatintegratestemporal-differencelearningandsearchfor a varietyof games(tic-tac-toe,Connect-4,andchess),but this problemdoesnotseemto bementioned.

Hsuet al. (1990b)implementedtheabove solutioninto their comparisontrainingframework. They comparedthe dominantposition of the subtreestartingwith thegrandmaster’s move to the dominantpositionof any alternative move at a shallow 5to 6 ply search.If analternativemove’s dominantpositiongetsa higherevaluationanappropriateadjustmentdirection is computedin the parameterspaceandthe weightvector is adjusteda little in that direction. For a more recenttechniqueintegratingcomparisontrainingwith deepersearchessee(Tesauro2001).

As far astemporal-differencelearningis concerned,it wasonly recentlythatBealandSmith(1997)andBaxter, Tridgell,andWeaver (1998b)independentlyre-discoveredthis techniquefor a properintegrationof learningandsearch.BealandSmith(1998)subsequentlyappliedthis techniquefor learningpiece-valuesin shogi. A properfor-malizationof the algorithm can be found in (Baxter et al. 1998a)and in (Baxter,Tridgell, andWeaver 2001)

4.5.4 Feature construction

The crucial point for all approachesthat tuneevaluationfunctionsis the presenceofcarefully selectedfeaturesthat captureimportantinformationaboutthe currentstateof the gamewhich goesbeyond the location of the pieces. In chess,conceptslikeking safety, centercontrolor mobility arecommonlyusedfor evaluatingpositions,andsimilar abstractionsareusedin othergamesaswell (LeeandMahajan1988;Ender-

22

Page 23: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

ton 1991). TesauroandSejnowski (1989)reportan increasein playingstrengthof 15to 20% whenaddinghand-craftedfeaturesthat captureimportantconceptstypicallyusedby backgammonexperts(e.g.,pipcounts)to their neuralnetworkbackgammonevaluationfunction. AlthoughTesaurolaterdemonstratedthathis TD( � )-trainednet-work couldsurpassthisplayinglevel without thesefeatures,re-insertingthembroughtyetanothersignificantincreasein playingstrength(Tesauro1992b).Samuel(1959)al-readyconcludedhisfamousstudyby makingthepointthatthemostpromisingroadto-wardsfurtherimprovementsof hisapproachmightbe“ . . . to gettheprogramto gener-ateits ownparametersfor theevaluationpolynomial” insteadof learningonly weightsfor manuallyconstructedfeatures.However, in thefollow-uppaper, hehadto concedethatthegoalof “ . . .gettingtheprogramto generateits ownparametersremainsasfarin thefutureasit seemedto bein 1959” (Samuel1967).

Sincethen, the most importantnew stepinto that direction hasbeenthe devel-opmentof automatedtraining proceduresfor multi-layer neuralnetworks.While theclassicalsingle-layerneuralnetwork—theso-calledperceptron (Rosenblatt1958;Min-sky andPapert1969)—combinestheinput featuresin a linearfashionto a singleout-put unit, multi-layerperceptronscontainat leastoneso-calledhiddenlayer, in whichsomeor all of theinputfeaturesarecombinedto intermediateconcepts(RumelhartandMcClelland1986;Bishop1995). In the mostcommoncase,the three-layernetwork,the outputsof thesehiddenlayerunitsaredirectly connectedto the outputunit. Theweightsto andfrom thehiddenlayeraretypically initializedwith randomvalues.Dur-ing training of the network,the weightsarepushedinto directionsthat facilitate thelearningof thetrainingsignalandusefulconceptsemergein thehiddenlayer. Tesauro(1992a)shows examplesfor two hiddenunitsof TD-GAMMON thathe interpretedasa race-orientedfeaturedetectorandanattack-orientedfeaturedetector.

Typically, the learningof thehiddenlayersis completelyunconstrained,andit isunclearwhich conceptswill evolve. In fact,differentrunson thesamedatamaypro-ducedifferentconceptsin thehiddenlayers(dependingon therandominitialization).Sometimes,however, it canbe advantageousto imposesomeconstraintson the net-work. For example,Schraudolph,Dayan,andSejnowski (1994)reportthat they man-agedto increasetheperformanceof their Go-playingnetworkby choosinganappro-priatenetworkarchitecturethat reflectsvarioussymmetriesand translation-invariantpropertiesof theGoboard.Similarly, LeouskiandUtgoff (1996)tried to exploit sym-metriesin the gameof Othello by sharingthe weightsbetweeneight sectorsof theboard.

Thedisadvantageof thefeaturesconstructedin thehiddenlayersof neuralnetworksis thatthey arenot immediatelyinterpretable.Severalauthorshave workedonalterna-tive approachesthatattemptto createsymbolicdescriptionsof new features.FawcettandUtgoff (1992)discusstheZENITH system,whichautomaticallyconstructsfeaturesfor a linearevaluationfunctionfor Othello.Eachfeatureis representedasa formulainfirst-orderpredicatecalculus.New featurescanbederivedby decomposition,abstrac-tion, regression,andspecializationof old features.Originally, the programhasonlyknowledgeof thegoalconceptandthedefinitionsof themove operators,but it is ableto derive several interestingfeaturesfrom them,includinga featurefor predictingfu-turepiecemobility, whichseemsto benew in theliterature(FawcettandUtgoff 1992).

23

Page 24: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

ELF (Utgoff andPrecup1998)constructsnew featuresasconjunctionsof a setofbasic,Booleanevaluationterms.ELF startswith themostgeneralfeaturethathasdon’tcare-valuesfor all basicevaluationtermsandthereforecoversall gamesituations.Itcontinuouslyadjuststhe weight for eachfeatureso that it minimizesits error on theincominglabelledtrainingexamples.It alsokeepstrackwhichof theevaluationtermswith don’t care-valuesare most frequentlypresentwhen suchadjustmentshappen.ELF canremove thesesourcesof error by spawning a new featurethat is identicaltoits parentbut alsospecifiesthat theevaluationtermthatcontributedmostto theerrormustnot be present.Newly generatedfeaturesstartwith a weightof 0, andfeatureswhoseweightstaysnear0 for acertainamountof timewill beremoved.Theapproachwasusedto learnto playcheckersfrom self-playwith limited success.

A similar system,the generalized linear evaluation modelGLEM (Buro 1998),constructsa largenumberof simpleBooleancombinationsof basicevaluationterms.GLEM simplygeneratesall featurecombinationsthatoccurwith auser-specifiedmin-imum frequency, which is usedto avoid overfitting of the trainingpositions.For thispurpose,GLEM usesan efficient algorithm that is quite similar to the well-knownAPRIORI datamining algorithmfor generatingfrequentitem setsin basketanalysisproblems(Agrawal etal. 1995).Contraryto ELF, whereboththeweightsandthefea-ture setareupdatedon-line andsimultaneously, the featuresareconstructedin batchandtheir weightsaredeterminedin a secondpassby linearregression.Theapproachworkedquitewell for theOthelloprogramLOGISTELLO, but Buro (1998)pointsoutthat it is particularlysuitedfor OthellowhereimportantconceptscanbeexpressedasBooleancombinationsof the locationof the pieceson the board. For othergames,whereonehasto usemorecomplicatedatomicfunctions,this approachhasnot yetbeentested.

Utgoff (2001)providesanin-depthdescriptionof theproblemof automaticfeatureconstruction.Someof the techniquesfor patternacquisition,which arediscussedinmoredetail in thenext section,canalsobeviewedin thiscontext.

5 Learning Patterns and Plans

It hasoften beenpointedout that game-playingsystemscanbe classifiedalongtwodimensions,searchandknowledge,which form a trade-off: perfectknowledgeneedsno searchwhile exhaustive searchneedsno knowledge. For example,onecanbuilda perfecttic-tac-toeplayerwith a look-up tablefor all positions(no search)or with afastalpha-betasearchthatsearcheseachpositionuntil theoutcomeis determined(noknowledge). Clearly, both extremesare infeasiblefor complex games,andpracticalsolutionshave to usea little bit of both.

However, the region on this continuumwhich representsthe knowledge-searchcombinationthat humansuse,the so-calledhumanwindow, is differentfrom the re-gion in whichgameplayingprogramsexcel (Clarke1977;Michie 1982;Michie 1983;van denHerik 1988; Winkler and Furnkranz1998). Stronghumanplayersrely ondeepknowledge,while game-playingprogramstypically usea lot of search. EvenTD-GAMMON, whosestrengthis mostlydueto the(learned)positionalknowledgeen-

24

Page 25: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

codedin its evaluationfunction,investigatesmoremovesthana humanbackgammonplayer.

Themainproblemseemstobetheappropriatedefinitionof backgroundknowledge,i.e., the definitionof thesetof basicconceptsor patternsthat theprogramcanusetoformulatenew concepts.Althoughmany gameshave a rich vocabulary thatdescribesimportantaspectsof thegame,theseareoftenhardto defineandevenharderto formal-ize. In chess,for example,conceptslike passedpawn, skewer, or kings’oppositionaremoreor lesscommonknowledge.However, evensimplepatternslike aknightfork arenon-trivial to formalize.12 Othergamessharesimilar problems.In Go, for example,it is very hardto describelong-distanceinfluencesbetweendifferentgroupsof stonesthatarequiteobviousto humanplayers(e.g.,a ladder). Obviously, if you cannotde-scribeaconceptin agivenhypothesislanguageyoucannotlearnit. Nevertheless,therehave beenseveralattemptsto build game-playingprogramsthatrely mainlyonpatternknowledge.Wewill review afew of themin thissection.

5.1 Advice-taking

The learningtechniquethat requiresthe leastinitiative by the learneris learning bytaking advice. In this framework, the useris ableto communicateabstractconceptsandgoalsto the program. In the simplestcase,the provided advicecanbe directlymappedon to theprogram’s internalconceptrepresentationformalism. Onesuchex-ampleisWaterman’spokerplayer(Waterman1970).Amongotherlearningtechniques,it providesthe userthe facility to directly addproductionrulesto the game-playingprogram. Anotherclassicexampleof suchan approachis the work by Zobrist andCarlson(1973), in which a chesstutor could provide the programwith a library ofusefulpatternsusinga chessprogramminglanguagethat lookeda lot like assemblylanguage.Many formalismshave sincebeendevelopedin thesamespirit (BratkoandMichie 1980;GeorgeandSchaeffer 1990;Michie andBratko1991),mostof themlim-ited to endgames,but somealsoaddressingthefull game(LevinsonandSnyder1993;Donninger1996). Otherauthorshave investigatedsimilar techniquesfor incorporat-ing long-termplanninggoalsinto searchprocedures(Kaindl 1982;OpdahlandTessem1994)or for specifyingtacticalpatternsto focussearch(Wilkins 1982).

While in the above-mentionedapproachesthe tutoring processis often more orlessequivalentto programmingin a high-level gameprogramminglanguage,typicallyadvice-takingprogramshave to devoteconsiderableeffort into compilingtheprovidedadviceinto their own patternlanguage.Thus they enablethe userto communicatewith theprogramin avery intuitivewaythatdoesnot requireany knowledgeabouttheimplementationof theprogramnoraboutprogrammingin general.

Themostprominentexamplefor suchanapproachis thework by Mostow (1981).Hehasdevelopedasystemthatis ableto translateabstractpiecesof advicein thecard-gameHeartsinto operationalknowledgethatcanbeunderstoodanddirectly accessed

12Thebasicpatternfor a knight fork is a knight threateningtwo pieces,therebywinning oneof them. Inthe endgame,thesemight simply betwo unprotectedpawns(unlessoneof themprotectstheother). In themiddlegame,thesearetypically higher-valuedpieces(protectedor not). However, this definitionmight notwork if theforking knight is attackedbut notprotectedor evenpinned.But thenagain,perhapstheattackingpieceis pinnedaswell. Or thepinnedknightcangiveadiscoveredcheck. . .

25

Page 26: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

by themachine.For example,theusercanspecifythehint “avoid takingpoints” andtheprogramis ableto translatethispieceof adviceinto a simple,heuristicsearchpro-cedurethatdeterminesthecardthatis likely to taketheleastnumberof points(Mostow1983).However, hissystemis notactuallyableto playa gameof Hearts.In particular,his architecturelacksa techniquefor evaluatingandcombiningthedifferentpiecesofadvicethatmight beapplicableto a givengamesituation. In recentwork, Furnkranz,Pfahringer, Kaindl, andKramer(2000)madea first stepinto that directionby iden-tifying two differenttypesof advice: stateabstraction advice(advicethatpointsthesystemto importantfeaturesof thegame,e.g.,determinewhetherthequeenis alreadyout)andmoveselectionadvice(advicethatsuggestspotentiallygoodmoves,e.g.,flushthequeen).They alsoproposedastraight-forwardreinforcementlearningapproachforlearninga mappingof stateabstractionsto move selections.Their architectureis quitesimilar to Waterman’s pokerplayer, which learnsto prioritizeactionrulesthatoperateon interpretationrules(Section5.3). The useof weightsfor combiningoverlappingandpossiblycontradictorypiecesof adviceis alsosimilar to HOYLE (Epstein1992),whereseveral Advisorsareableto commentpositively or negatively on the availablesetof moves. Initially, all of the Advisorsaredomain-independentandhand-coded,but HOYLE is ableto autonomouslyacquireappropriateweightsfor the Advisorsforthecurrentgamesandto learncertaintypesof novel pattern-basedAdvisors(cf. Sec-tion 5.2and,in particular, (Epstein2001).

Similar in spirit to Mostow’s work describedabove, Donninger(1996)introducesa very efficient interpreterof an extensiblelanguagefor expressingchesspatterns.Its goal is to allow the userto formulatesimplepiecesof advicefor NIMZO, oneofthemostcompetitive chessprogramsavailablefor thePC.ThemajordifferencefromMostow’s work is the fact thatpeoplearenot usinglanguageto expresstheir advice,but canrely onanintuitivegraphicaluserinterface.Thepatternsandassociatedadvicearethencompiledinto a bytecodethatcanbeinterpretedat run-time(oncefor everyroot positionto awardbonusesfor movesthat follow someadvice).Oneof theshort-comingsof thesystemis thatit is mostlylimited to expressingpatternsin propositionallanguage,which doesnot allow to expressrelationshipsbetweenpieceson theboard(e.g.,“put therookbehindthepassedpawn”).

A third approachthathassimilar goalsis theuseof so-calledstructured induction(Shapiro1987). The basic idea is to inducepatternsthat are not only reliable butalsosimpleandcomprehensibleby enablingtheinteractionof a domainexpertwith alearningalgorithm,allowing it to focusthe learneron relevantportionsof thesearchspaceby definingnew patternsthatbreakup theprobleminto smaller, understandablesub-problemsand useautomatedinduction for eachof them. We will discussthisapproachin moredetail in Section5.5.

5.2 Cognitive models

Psychologicalstudieshave shown that the differencesin playing strengthsbetweenchessexpertsandnovicesarenot somuchdueto differencesin theability to calculatelongmovesequences,but to whichmovesthey startto calculate(deGroot1965;ChaseandSimon1973;Holding 1985;de Groot andGobet1996;GobetandSimon2001)For this pre-selectionof moveschessplayersmakeuseof patternsandaccompanying

26

Page 27: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

promisingmoves andplans. SimonandGilmartin (1973)estimatethe numberof achessexpert’s patternsto be of the orderof 10,000to 100,000. Similar resultshavebeenfoundfor othergames(Reitman1976;EngleandBukstel1978;Wolff etal. 1984).On theotherhand,moststrongcomputerchessplayingsystemsrely ona few hundredhand-codedpatterns,typically the evaluationfunction terms,andattemptto counter-balancethisdeficiency by searchingordersof magnitudesmorepositionsthanahumanchessplayerdoes. In gameslike Go, wherethe searchspaceis too big for suchanapproach,computerplayersmakeonly slow progress.

Naturally, several researchershave tried to basetheir game-playingprogramsonfindingsin cognitive science.In particularearlyAI researchhasconcentratedon thesimulationof aspectsof theproblem-solvingprocessthatarecloselyrelatedtomemory,like perception(SimonandBarenfeld1969)or retrieval (SimonandGilmartin 1973).Severalauthorshavemoreor lessexplicitly reliedonthechunkinghypothesisto designgame-playingprogramsby providing it with accessto amanuallyconstructeddatabaseof patternsandchunks(Wilkins 1980; Wilkins 1982; Berliner andCampbell1984;George andSchaeffer 1990). Naturally, attentionhasalsofocussedon automaticallyacquiringsuchchunklibraries.

Someof theearlyideasonperceptionandchunkretrieval wereexpandedandinte-gratedinto CHREST (Gobet1993),arguablythemostadvancedcomputationalmodelof a chessplayer’s memoryorganization.While CHREST mainly served the purposeof providing a computationalmodelthat is ableto explain andreproducethepsycho-logical findings, CHUMP (Gobetand Jansen1994) is a variantof this programthatis actuallyable to play a game. CHUMP playsby retrieving moves that it haspre-viously associatedwith certainchunksin the program’s patternmemory. It usesaneye-movementsimulatorsimilar to PERCEIVER (SimonandBarenfeld1969)to scantheboardinto afixednumber(20)of meaningfulchunks.New chunkswill beaddedtoa discriminationnet(SimonandGilmartin 1973)which is basedon theEPAM modelof memoryandperception(Feigenbaum1961). During the learningphasethe movethathasbeenplayedin thepositionis addedinto a differentdiscriminationnetandislinkedto thechunkthatcontainsthemovedpieceat its original location.In theplayingphasetheretrievedpatternsarecheckedfor associatedmoves(only about10%of thechunkshave associatedmoves). The move that is suggestedby themostchunkswillbe played. Experimentalevaluation(on a collectionof gamesby Mikhail Tal andintheKQKR ending)wasableto demonstratethatthecorrectmove wasoftenin thesetof movesthatwereretrievedby CHUMP, althoughonly rarelywith thehighestweight.Oneproblemseemsto bethatthenumberof learnedchunksincreasesalmostlinearlywith thenumberof seenpositions,while thepercentageof chunksthathave anasso-ciatedmove remainsconstantover time. This seemsto indicatethat CHUMP re-usesonly few patterns,while it continuouslygeneratesnew patterns.

TAL (Flinter and Keane1995) is a similar systemwhich also usesa library ofchunks,which hasalsobeenacquiredfrom a selectionof Tal’s games,for restrictingthe numberof movesconsidered.It differs in the detailsof the representationof thechunksandthe implementationof their retrieval. Here,theauthorsobserveda seem-ingly logarithmicrelationshipbetweenthe frequency of a chunk and the numberofoccurrencesof chunkswith that frequency. This seemsto confirmtheabove hypoth-

27

Page 28: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

esis.Walczak(1991)suggesttheuseof a chunklibrary for predictingthemovesof a(human)opponent.We will discusshisapproachin Section6.

CASTLE (Krulwich 1993)is basedonRogerSchank’scognitivetheoryof explana-tion patterns(Schank1986)andits computationalmodel,case-basedreasoning(Ries-beckandSchank1989). In particular, it is basedon the case-basedplanningmodel(Hammond1989),in whichplansthathave beengeneralizedfrom pastexperiencesarere-usedand,whenever they fail, continuouslydebugged.CASTLE consistsof severalmodules(Krulwich et al. 1995).Thethreatdetectioncomponentis a setof condition-actionrules,eachrule beingspecializedfor recognizinga particulartypeof threatinthecurrentfocusof attentionwhichis determinedby anotherrule-basedcomponent.Acounter-planningmoduleanalyzesthediscoveredthreatsandattemptsto find counter-measures.CASTLE invokeslearningwheneverit encountersanexplanationfailure, e.g.whenit turnsout that the threatdetectionandthe counter-planningcomponentshavefailedto detector preventa threat.In suchacaseCASTLE usesaself-modelto analyzewhich of its componentsis responsiblefor the failure, tries to find an explanationofthefailure,andthenemploysexplanation-basedlearningto generalizethisexplanation(Collinsetal. 1993).CASTLE hasdemonstratedthatit cansuccessfullylearnplansforinterposition,discoveredattacks,forks,pinsandothershort-termtacticalthreats.

Kerner(1995)alsodescribesa case-basedreasoningprogrambasedon Schank’sexplanationpatterns(Schank1986). The basicknowledgeconsistsof a hierarchyofstrategic chessconcepts(like backwardpawnsetc.) thathasbeencompiledby anex-pert. Indexed with eachconceptis an explanationpattern(XP), roughly speakingavariabilizedplan thatcanbe instantiatedwith the parametersof the currentposition.Whenthesystemencountersa new positionit retrievesXPsthatinvolvethesamecon-ceptandadaptstheplanfor thepreviouscaseto thenew positionusinggeneralization,specialization,replacement,insertionand deletionoperators. If the resultingXP isapprovedby a chessexpertit will bestoredin thecasebase.

HOYLE (Epstein1992)is agame-playingsystembasedonFORR,a generalarchi-tecturefor problemsolvingandlearning(Epstein1994a).It is ableto learnto play avarietyof simpletwo-person,perfect-informationgames.HOYLE reliesonseveralhi-erarchicallyorganizedAdvisorswhichareableto votefor or againstsomeof theavail-ablemoves.TheAdvisorsarehand-coded,but game-independent.TheimportanceoftheindividualAdvisorsis learnedwith anefficient,supervisedweight-tuningalgorithmEpstein(1994b,Epstein(2001)A specialAdvisor—PATSY—is ableto makeuseof anautomaticallyacquiredchunklibrary, andcommentsin favor of patternsthatareasso-ciatedwith wins andagainstpatternsthatareassociatedwith losses.Thesechunksareacquiredusinga collectionof so-calledspatialtemplates, a meta-languagethatallowsto specifywhich subpartsof thecurrentboardconfigurationareinterestingto becon-sideredaspatterncandidates.Patternsthatoccurfrequentlyduring play areretainedandassociatedwith the outcomeof the game(Epsteinet al. 1996). HOYLE is alsoable to generalizethesepatternsinto separate,pattern-orientedAdvisors. Similar toPATSY, anotherAdvisor—ZONE RANGER—maysupportmovesto a positionwhosezoneshave positiveassociations,wherea zoneis definedasa setof locationsthatcanbe reachedin a fixed numberof moves. The patternsandzonesusedby PATSY andZONE RANGER areattemptsto captureandmodelvisual perception. Thereis alsosomeempiricalevidencethat HOYLE exhibits similar playing andlearningbehavior

28

Page 29: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

thanhumangameplayers(RattermannandEpstein1995). A detaileddescriptionofHOYLE canbefoundin (Epstein2001).

5.3 Pattern-based learning systems

Therearea varietyof otherpattern-basedplayingsystemsthatarenotexplicitly basedonresultsin cognitivescience.Quitewell-knownis PARADISE (Wilkins 1980,1982),apattern-basedexpertsystemthatis ableto solvecombinationsin chesswith aminimumamountof search. It doesnot use learningbut therea few other systemsthat relyupontheautomaticformationof understandablepatternandruledatabasesthatcaptureimportantknowledgefor game-playing.

Theclassicexampleis Waterman’spokerplayer(Waterman1970).It hastwo typesof rules, interpretationrules which mapgamestatesto abstractdescriptors(by dis-cretizing numericalvalues)andaction rules which mapstatedescriptorsto actions.Theprogramcanlearntheserulesby acceptingadvicefrom theuseror from a hand-codedprogram(cf. Section5.1) or it canautomaticallyconstructruleswhile playing.Eachhandplayedis ratedby anoraclethatprovidestheactionthatcanbeproved tobebestin retrospect(whenall cardsareknown). After thecorrectactionhasbeende-termined,interpretationrulesareadjustedsothat they mapthecurrentstateto a statedescriptorthatmatchesauser-providedso-calledtrainingrule,whichis associatedwiththecorrectaction.Theprogramthenexaminesits actionrulebaseto discover therulethat madethe incorrectdecision. The mistakeis correctedby specializingthe incor-rect rule,by generalizingoneof therulesthatareorderedbeforeit or by insertingthetraining rule beforeit. During play, theprogramsimply runsthroughthe orderedlistof actionrulesandperformsthe actionrecommendedby the first rule matchingthestatedescriptionthat hasbeenderivedwith the currentsetof interpretationrules. Insubsequentwork, Smith(1983)usedthesamerepresentationformalismandthesameknowledgebasefor evolving pokerplayerswith ageneticalgorithm.

Another interestingsystemis the chessprogramMORPH (Levinson and Snyder1991;Gould andLevinson1994). Its basicrepresentationalentity arepattern-weightpairs (pws). Patternsarerepresentedasconceptualpositiongraphs,the edgesbeingattackrelationshipsandthe nodesbeingpiecesandimportantsquares.Learnedpat-ternscanbegeneralizedby extractingthecommonsubtreeof two patternswith similarweightsor specializedby addingnodesandedgesto thegraph.Dif ferentpatternscanbecoupledusingreverseengineering, sothatthey resultin achainof actions.Thepat-tern weightsareadjustedby a variantof temporal-differencelearning,in which eachpatternhasits own learningrate,which is setby simulatedannealing(i.e., the morefrequentlya patternis updated,theslower becomesits learningrate). MORPH evalu-atesa move by combiningtheweightsof all matchedpatternsinto a singleevaluationfunctionvalueandselectingthemove with thebestvalue,i.e. it only performsa 1-plylook-ahead.The systemhasbeentestedagainstGNUCHESS andwasableto beatitoccasionally. The samearchitecturehasalsobeenappliedto gamesotherthanchess(Levinsonetal. 1992).

Kojima, Ueda,andNagano(1997)discussanapproachfor learninga setof rulesfor solving tsume-goproblems. Theseare inducedwith an evolutionary algorithm,which wassetup to reinforcerulesthatpredicta move in a database.New rulesare

29

Page 30: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

constructedfor situationsin which no otherrule matchesthemove playedin thecur-rent position. Ruleswith high reliability canspawn childrenthat specializethe ruleby addingoneconditionandinherit someof theparent’s activationscore.KojimaandYoshikawa (2001)showedthat this approachis ableto outperformpatternswith fixedreceptive fields,similar to thosethatareoftenusedin Goprogramsto reducethecom-plexity of theinput(seee.g.,Stoutamire1991;Schraudolphetal. 1994;or Dahl2001).

5.4 Explanation-based learning

Explanation-basedlearning(EBL) (Mitchell et al. 1986)hasbeendevisedasa proce-durethatis ableto learnpatternsin domainswith arich domaintheory. Thebasicideais thatthedomaintheorycanbeusedto find anexplanationfor a givenexamplemove.Generalizingthis explanationyieldsa pattern,to which the learnertypically attachesthemove thathasbeenplayed.In thefuture,theprogramwill play this (typeof) movein situationswherethepatternapplies,i.e., in situationsthatsharethosefeaturesthatarenecessaryfor theapplicabilityof this particularexplanation.In game-playingdo-mains,it is typically easyto representtherulesof thegameasa formaldomaintheory.Hence,several authorshave investigatedEBL approachesthat allow to learnusefulpatternsfrom a formalizationof therulesof thegame.

A very earlyexplanation-basedlearningapproachto chesshasbeendemonstratedby Pitrat (1976a,1976b). His programis able to learndefinitionsfor simplechessconceptslike skewer or fork. Theconceptsarerepresentedasgeneralizedmove trees,so-calledprocedures. To understanda move theprogramtriesto constructa move treeusingalreadylearnedprocedures.This treeis thensimplified(unnecessarymovesareremoved) andgeneralized(squaresandpiecesarereplacedwith variablesaccordingto somepredefinedrules),in orderto producea new procedurethat is applicableto avarietyof similar situations.

Pitrat’sprogramsuccessfullydiscovereda varietyof usefultacticalpatterns.How-ever, it turnedout that the size of the move treesgrows too fast oncethe programhaslearneda significantnumberof patternsof varying degreesof utility. This util-ity problemis a widely acknowledgeddifficulty for explanation-basedlearningalgo-rithms(Minton 1990)andpattern-learningalgorithmsin general.Interestingly, it hasled Pitrat to the conclusionthat the learningmoduleworks well, but a moresophis-ticatedmethodfor dealingwith the many proceduresdiscoveredduring learningisneeded(Pitrat1977).

Minton (1984)reportsthesameproblemof learningtoomany toospecializedruleswith explanation-basedlearningfor the gameof Go-moku. His programperformsabackwardanalysisafter the opponenthasachieved it goal (to have five stonesin arow). It thenidentifiesthepatternthathastriggeredtherulewhichhasestablishedthattheopponenthasachieveda goal. Fromthis patternall featuresthathave beenaddedby theopponent’s lastmove areremoved andall conditionsthathave beennecessaryto enabletheopponent’s lastmove areadded.Thustheprogramderivesa patternthatallows to recognizethedangerbeforetheopponenthasmadethedevastatingmove. Iftheprogram’sown lastmove hasbeenforced,thepatternis againchangedby deletingthe effectsof this move andaddingthe prerequisitesfor this move. This processof

30

Page 31: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

patternregressionis continueduntil oneof theprogram’s moveshasnot beenforced.The remainingpatternwill thenform the conditionpart of a rule that recognizesthedangerbeforetheforcedsequenceof movesthathasachievedtheopponent’sgoal.

However, the limitation to tactical combinations,whereeachof the opponent’smovesis forced,is too restrictive to discover interestingpatterns.Consequently, sev-eralauthorshave tried to generalizethesetechniques.Puget(1987),e.g.,simply madetheassumptionthatall theopponent’smovesareoptimal (hehasreacheda goalafterall), but presentsa techniquethatis ableto includealternativemoveswhenregressingpatternsover theprogram’sown moves.However, his techniquehadto sacrificesuffi-ciency, i.e., it couldno longerbeguaranteedthatthemove associatedwith thepatternwill actuallyleadto a forcedwin. Lazyexplanation-basedlearning (Tadepalli1989)doesthis evenmoreradicallyby learningover-general,overly optimisticplans,whicharegeneratedby simply not checkingwhetherthe movesthat led to theachievementof a goal were forced. If one of theseoptimistic planspredictsthat its applicationshouldachieve acertaingoal,but it is notachievedin thecurrentgame,theopponent’srefutationis generalizedto a counter-planthat is indexedwith theoriginal,optimisticplan. Thus,thenext time this planis invoked,theprogramis ableto foreseethis refu-tation.Technically, this is quitesimilar to thecase-basedplanningapproachdiscussedin Section5.2.

Anotherproblemwith suchan approachis that in gameslike chessthe ultimategoal (checkmate)is typically too distantto be of practicalrelevance. Consequently,Flann(1989)describesPLACE, aprogramthatreasonswith automaticallyacquiredab-stractionsin theform of new subgoalsandoperatorsthatmaintain,destroyor achievesucha goal. Complex expressionsfor theachievementof multiple goalsareanalyzedby theprogramandcompiledby performinganexhaustive caseanalysis(Flann1990,1992),which is ableto generategeometricalconstraintsthatcanbeusedasa recogni-tion patternfor theabstractconcept.

Peoplealso looked for ways of combiningEBL with other learningtechniques.FlannandDietterich(1989)demonstratemethodsfor augmentingEBL with asimilarity-basedinductionmodulethatis ableto inductively learnconceptsfrom multipleexpla-nationsof several examples. In a first step,the commonsubtreeof the explanationsfor eachof the examplesis computedandgeneralizedwith EBL. In a secondphasethis generalconceptis inductively specializedby replacingsomevariableswith con-stantsthat arecommonin all examples. DietterichandFlann (1997)remarkedthatexplanation-basedandreinforcementlearningareactuallyquitesimilarin thewaybothpropagateinformationfrom thegoalstatesbackward.Consequently, they proposeanalgorithmthatintegratesthegeneralizationability of explanation-basedgoalregressioninto aconventionalreinforcementlearningalgorithm.Amongothers,they demonstratea successfulapplicationof the techniqueto the KRK chessendgame.Flann(1992)showssimilar resultsonotherendingsin chessandcheckers.

However, becauseof theabove-mentionedproblems,mostimportantlytheutilityproblemandtheneedfor a domaintheorythatcontainsmoreinformationthana mererepresentationof the rulesof the games,interestin explanation-basedlearningtech-niqueshasfaded.Recently, it hasbeenrejuvinatedfor thegameof Go, whenKojima(1995)proposedto useEBL techniquesfor recognizingforcedmoves.This techniqueis describedin moredetail in (Kojima andYoshikawa 2001). A relatedtechniqueis

31

Page 32: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

usedin theGoprogramGOGOL for acquiringrulesfrom tacticalgoalsthatarepassedto theprogramin theform of meta-programs(Cazenave 1998).

5.5 Pattern induction

In additiontodatabasesof commonopeningsandhugegamecollections,whichhave—sofar—beenmostlyusedfor the tuningof evaluationfunctions(Section4) or theau-tomaticgenerationof openingbooks(Section2), for many games,completesubgameshavealreadybeensolved,anddatabasesareavailablein whichthegame-theoreticvalueof positionsof thesesubgamescanbelookedup. In chess,for example,all endgameswith up to five piecesand somesix-pieceendgameshave beensolved (Thompson1996).Similarendgamedatabaseshavebeenbuilt for othergameslike checkers(Lakeet al. 1994). Somegamesareevensolvedcompletely, i.e., all possiblepositionshavebeenevaluatedandthe game-theoreticvalueof the startingpositionhasbeendeter-mined(Allis 1994;Gasser1995).Many of thesedatabasesarereadilyavailable,someof them(in the domainsof chess,connect-4and tic-tac-toe)arecommonlyusedasbenchmarkproblemsfor evaluatingmachine-learningalgorithms.13 For someof them,notonly theraw positioninformationbut alsoderivedfeatures,whichcapturevaluableinformation(similar to termsin evaluationfunctions),aremadeavailablein ordertofacilitatetheformulationof usefulconcepts.

Theearliestwork on learningfrom a chessdatabaseis reportedby Michalski andNegri (1977),who appliedtheinductiverule learningalgorithmAQ (Michalski 1969)to the KPK databasedescribedby Clarke(1977). Thepositionsin thedatabasewerecodedinto 17 attributesdescribingimportantrelationsbetweenthe pieces.The goalwas to learn a set of rules that areable to classifya KPK positionas win or draw.After learningfrom only 250trainingpositions,a setof ruleswasfoundthatachieved80%predictive accuracy on this task,i.e., it wasableto correctlyclassify80%of theremainingpositions.

Quinlan (1979) describesseveral experimentsfor learningrules for the KRKNendgame.In (Quinlan1983),heusedhis decisiontreelearningalgorithmID3 to dis-cover recognitionrulesfor positionsof the KRKN endgamethat are lost in 2 ply aswell as for thosethat are lost in 3 ply. From lessthan10% of the possibleKRKNpositions,ID3 wasableto derive a treethat committedonly 2 errorson a testsetof10,000randomlychosenpositions(theseerrorswere later correctedby VerhoefandWesselius(1987)).Quinlannotedthatthisachievementwasonly possiblewith a care-ful choiceof theattributesthatwereusedto representthepositions.Findingtherightsetof attributesfor thelost-in-2-plytaskrequiredthreeweeks.Adaptingthissetto theslightly differenttaskof learninglost-in-3-plypositionstookalmosttwo months.Thusfor thelost-in-4-plytask,whichheintendedto tacklenext, Quinlanexperimentedwithmethodsfor automatingthe discovery of new usefulattributes. However, no resultsfrom this endeavor have beenpublished.

A severeproblemwith this andsimilar experimentsis that, althoughthe learneddecisiontreescanbeshown to be correctandfasterin classificationthanexhaustivesearchalgorithms,they aretypically alsoincomprehensibleto chessexperts. Shapiro

13Theyareavailablefrom EGFGFGH�IKJGJMLGLMLONQPSRUTVNXWYR#PZNK[-\GWYJU]2^`_M[Ma#bdc#JMeMfGgG[MHYhMT#PdFGhGbMiVNBEUFd^0_ .32

Page 33: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

andNiblett (1982)triedto alleviatethisproblemby decomposingthelearningtaskintoahierarchyof smallersub-problemsthatcouldbetackledindependently. A setof rulesis inducedfor eachof thesub-problemswhichtogetheryield amoreunderstandablere-sult. Thisprocessof structuredinductionhasbeenemployedto learncorrectclassifica-tion proceduresfor theKPK andtheKPa7KRendgames(Shapiro1987).An endgameexpert helpedto structurethe searchspaceandto designthe relevantattributes. Therulesfor the KPa7KR endgamesweregeneratedwithout the useof a databaseasanoracle.Theruleswereinteractively refinedby theexpertwhocouldspecifynew train-ing examplesand suggestnew attributesif the available attributeswere not able todiscriminatebetweensomeof the positions. This rule-debuggingprocesswasaidedby a self-commentingfacility that displayedtracesof theclassificationrulesin plainEnglish(ShapiroandMichie 1986). A similar semi-autonomousprocessfor refiningthe attribute setwasusedby Weill (1994) to generatedecisiontreesfor the KQKQendgame.

However, the problemof decomposingthe searchspaceinto easilymanageablesubproblemsis a taskthatalsorequiredextensive collaborationwith a humanexpert.Consequently, therehave beenseveral attemptsto automatethis process. Paterson(1983)tried to automaticallystructuretheKPK endgameusinga clusteringalgorithm.The resultswerenegative, asthe foundhierarchyhadno meaningto humanexperts.Muggleton(1990)appliedtherule learningalgorithmDUCE to theKPa7KRtaskstud-ied by Shapiro(1987). DUCE is is ableto autonomouslysuggesthigh-level conceptsto the user, similar to someof the approachesdiscussedin Section4.5.4. It looksfor commonpatternsin the rule base(which initially consistsof a setof ruleseachdescribingoneexampleboardposition)and tries to reduceits sizeby replacingthefoundpatternswith new concepts.In machinelearningtheautonomousintroductionofnew conceptsduringthelearningphaseis commonlyknown asconstructiveinduction(Matheus1989). Using constructive induction,DUCE reducesthe role of the chessexpertto amereevaluatorof thesuggestedconceptsinsteadof aninventorof new con-cepts.DUCE structuredtheKPa7KRtaskinto a hierarchyof 13conceptsdefinedby atotal of 553rules.Shapiro’ssolution,however, consistedof 9 conceptswith only 225rules.NeverthelessDUCE’s solutionwasfoundto bemeaningfulfor a chessexpert.

Most of thesesystemsareevaluatedwith respectto their predictive accuracy, i.e.,their ability to generalizefrom a few examplesto new examples.Although thesere-sultsarequiteinterestingfrom amachine-learningpointof view, theirutility for game-playingis lessobvious. In gameplaying,oneis typically not interestedin generaliza-tionsfrom a few exampleswhena completedatabaseis alreadyavailable.However, itwouldbea valuableresearchprojectto simplycompresstheavailabledatabasesinto afew (hundred)powerful classificationrulesthatareableto reproducetheclassificationof the positionsstoredin the database.Currently, I/O costsareoften prohibitive sothatduringregularsearches,anendgame-databaselookupis only feasibleif theentiredatabasecanbekeptin mainmemory. Accessto externaldata(e.g.,a databasestoredon a CD-ROM) is only feasibleat the root of the search.Variousefficient compres-siontechniqueshave beenproposedfor this purpose(Heinz1999a,1999b).However,if oneonly hasto storea setof classificationrules,perhapswith a few exceptions,thememoryrequirementscouldbesignificantlyreduced.Again, themainobstacletothis endeavor is not the lack of suitablemachine-learningalgorithms—nowadaysML

33

Page 34: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

algorithmsarefit for addressinghundredsof thousandsof examplesandcanemploypowerful subsamplingtechniquesfor largerdatabases(see(BainandSrinivasan1995)for somepreliminaryexperimentsoncompressingdistance-to-mateinformationin theKRK endgame)—but thelackof availability of appropriatebackgroundknowledgeforthesedatabases.If the learningalgorithmsarenot given the right featuresthat areusefulfor formulatingtheconcepts,they cannotbeexpectedto generalizewell.

5.6 Learning playing strategies

We definea playing strategy as a simple, interpretableprocedurefor (successfully)playing a gameor subgame.Learningsucha strategy is significantlyharderthantolearnto classifypositionsaswonor lost. Considerthecaseof asimplechessendgame,like KRK. It is quitetrivial to learna functionthatrecognizeswonpositions(basicallytheseareall positionswhereBlackcannotcaptureWhite’srookandis not stale-mate).However, avoiding drawing variationsby keepingone’s rook safeis not equivalenttomakingprogresstowardsmatingone’s opponent.This problemof having to choosebetweenmoves that all look equallygoodhasbeentermedthe mesaeffect (Minsky1963).

Bain, e.g, hastried to tackle this problemby learningan optimal player from aperfectdatabasefor theKRK endgame(Bain1994;BainandSrinivasan1995).Hetriedto learnrulesthatpredictthenumberof pliesto a win with optimalplayon bothsidesin theKRK endgame.Suchapredictorcouldbeusedfor asimpleplayingstrategy thatsimply picksthemove thatpromisestheshortestdistanceto win. He foundthatit wasquiteeasyto learnrulesfor discriminatingdrawn positionsfrom wonpositions,but thatthetaskgotharderthelongerthedistanceto win was.Discriminatingpositionsthatarewonin 12pliesfrompositionsthatarewonin 13pliesprovedtobeconsiderablyharder,at leastwith thesimplegeometricconceptsthatheusedasbackgroundknowledgeforthelearner.

On the otherhand,humanplayersareableto follow a very simple,albeit subop-timal strategy to win theKRK chessendgame.Every chessnovice quickly learnstowin this endgame.An automatonthat could play this endgame(albeit only from afixedstartingposition)wasalreadyconstructedat thebeginningof this centuryby theSpanishengineerLeonardoTorresy Quevedo(Vigneron1914). Thus,it seemsto beworth-whileto look for simplepatternsthatallow a playerto reliablywin theendgamebut notnecessarilyin theoptimalway(i.e.,with theleastnumberof moves).

PAL (Morales1991;Morales1997)usesinductive logic programmingtechniquesto inducecomplex patternsin first-orderlogic. It requiresa humantutor to provide afew trainingexamplesfor theconceptsto learn,but assistsherby generatingadditionalexampleswith a perturbationalgorithm. This algorithmtakesthecurrentconceptde-scriptionandgeneratesnew positionsby changingcertainaspectslike thesideto move,thelocationof somepieces,etc.Theseexampleshavetobeevaluatedby theuserbeforethey areusedfor refining the learnedconcepts.PAL’s architectureis quitesimilar toShapiro’sstructuredinduction(Section5.1)but it is ableto induceconceptsin a morepowerful descriptionlanguage.Morales(1994)demonstratedthis flexibility by usingPAL to learna playingstrategy for theKRK endgame.For this taskit wastrainedby achessexpertthatprovidedexamplesfor asetof usefulintermediateconcepts.Another

34

Page 35: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

referencesystemis dueto Muggleton(1988).His programusessequenceinductiontolearnstrategiesthatenableit to play a partof thedifficult KBBKN endgame,namelythetaskof freeingacorneredbishop.

It is no coincidencethatall of the above examplesarefrom the domainof chessendgames.The reasonfor this is that on the onehandmany perfectchessendgamedatabasesareavailable,while, on theotherhand,thesearenot well-understoodby hu-manplayers.Themostfamousexamplearetheattemptsof a formerCorrespondenceChessWorld Championanda CanadianChessChampionto defeatKenThompson’sperfectKQKR databasewithin 50 moves(Michie 1983;Levy andNewborn1991)orthe attemptof an endgamespecialistto defeata perfectdatabasein the “almost un-documentedandvery difficult” KBBKN endgame(Roycroft1988).GM JohnNunn’seffort to manuallyextractsomeof theknowledgethat is implicitly containedin thesedatabasesresultedin a seriesof widely acknowledgedendgamebooks(Nunn 1992,1994b,1995),but Nunn(1994a)readily admittedthathe doesnot yet understandallaspectsof thedatabasesheanalyzed.Webelievethatthediscoveryof playingstrategiesfor endgamedatabasesis aparticularlyrewardingtopicfor furtherresearch(Furnkranz1997).

6 Opponent Modeling

Opponentmodelingis animportant,albeitsomewhatneglectedresearchareain gameplaying.Thegoalis to improvethecapabilitiesof themachineplayerby allowing it toadaptto its opponentandexploit his weaknesses.Even if a game-theoreticaloptimalsolutionto a gameis known, a systemthathasthecapabilityto modelits opponent’sbehavior mayobtaina higherreward. Consider, for example,thegameof rock-paper-scissorsakaRoShamBo, whereeitherplayercanexpectto win onethird of thegames(with onethird of draws) if both playersplay their optimal strategies(i.e., randomlyselectoneof their threemoves). However, againsta player that alwaysplays rock,a player that is ableto adapthis strategy to alwaysplaying papercanmaximizehisreward,while aplayerthatstickswith the“optimal” randomstrategy will still win onlyonethird of the games(Billings 2000b). In fact, therehave beenrecenttournamentson this gamein which thegame-theoreticallyoptimalplayer(therandomplayer)onlyfinishedin themiddleof thepackbecauseit wasunableto exploit theweakerplayers(Billings 2000a;Egnor2000).

For theabove reason,opponentmodelinghasrecentlybecomea hot topic in gametheory. Work in thisareais beyondthescopeof thispaper(andtheauthor’sexpertise);wehaveto referthereaderto (Fudenberg andLevine1998)or to Al Roth’son-linebib-liography.14 For someAI contributionsto game-theoreticproblemssee,e.g.,(Littman1994; MondererandTennenholtz1997; BrafmanandTennenholtz1999; Koller andPfeffer 1997).

Surprisingly, however, opponentmodelinghasnot yet received muchattentioninthe computer-gamescommunity. Billings et al. (1998b)seethe reasonfor that inthe fact that“in gameslike chess,opponentmodelingis not critical to achieving high

14SeeEGFUFGHOIKJGJjLGLGLONB[GRGh-cYhk^`PMRGTlNBEGaGbMmMaYbM\VNK[M\MWYJ`]Qa#bGh-FGEGJMn0P2nONBEMFM^`_ #learnbib35

Page 36: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

performance.” While this may be true for somegames,it is certainlynot true for allgames.“In poker, however,” they continue“opponentmodelingis essentialto success.”

Although variousother authorshave noticedbeforethat minimax searchalwaysstrivesfor optimalplay againstall playerswhile it couldbemorerewardingto try tomaximizetherewardagainsta fallible opponent,Jansen(1990)wasprobablythefirstwho explicitly tried to exploit this by investigatingtheconnectionof searchdepthandmove selection.His basicideawasto accumulatevariousstatisticsaboutthe search(like thenumberof fail-highsor fail-lows in analpha-betasearch)in orderto extractsomemeta-informationabouthow thesearchprogresses.In particular, hesuggestedtolook for trapandswindlepositions,i.e.,positionsin whichoneof theopponent’smoveslooks much better(much worse)than its alternatives,assumingthat the opponent’ssearchdepthis limited. Trappositionshave thepotentialthattheopponentmightmakea badmove becauseits refutationis too deepin the searchtree,while in a swindlepositiontheplayercanhopethattheopponentoverlooksoneof hisbestrepliesbecausehedoesnot seethecrucialcontinuation.15 In subsequentwork, Jansen(1992b,1992a,1993)hasinvestigatedtheuseof heuristicsthatoccasionallysuggestsuboptimalmoveswith swindlepotentialfor theweakersideof KQKR chessendgame.Otherresearchersproposedtechniquesto incorporatesuchopponentmodelsinto thesystem’sownsearchalgorithm(CarmelandMarkovitch 1993,1998b;Iida et al. 1994).

To ourknowledge,CarmelandMarkovitch (1993)werethefirst to try to automat-ically learn a modelof theopponent’smove selectionprogress.First, they proposedasimplealgorithmfor estimatingthe depthof the opponent’s minimaxsearch.To thisend,they employeda comparisontrainingtechnique(seeSection4.2). For eachtrain-ing position,all movesareevaluatedby a limited-depthsearch.Thenumberof movesthataredeemedbetterthanthemoveactuallyplayedis subtractedfrom thescoreof thecurrentsearchdepth,while thenumberof movesthatareconsideredequalor worseisaddedto thescore.Thisprocedureis repeatedfor all searchdepthsupto acertainmax-imum. Thedepthachieving themaximumscoreis returned.Subsequently, theauthorsalsoproposeda way for learningtheopponent’s(linear)evaluationfunctionby super-visedlearning:for eachpossiblemovein thetrainingpositions,thealgorithmgeneratesa linearconstraintthatspecifiesthatthe(opponent’s)evaluationof theplayedmovehasto beat leastasgoodasthealternativeunderconsideration(which theopponentchosenot to play). A setof weightssatisfyingtheseconstraintsis thenfoundby linearpro-gramming.Thisprocedureis repeateduntil theprogrammakesnosignificantprogressin predictingthe opponent’s moveson anindependenttestset. Again, this is appliedto avarietyof searchdepths,wherefor depthsop� thedominantpositionsof themini-maxtreesof thetwo movesareusedin thelinearconstraints.16 CarmelandMarkovitch(1993)testedtheirapproachin acheckersplayer. Theopponentmodelwastrainedoff-line on800trainingpositions,andthelearnedmodelswereevaluatedby measuringthe

15Uiterwijk andvandenHerik (1994)suggestto alsogiveabonusfor continuationswheretheproportionof high-qualitymovesis relatively low. However, it is questionablewhethersuchanapproachwill produceahigh gain.Forexample,thisapproachwouldalsoawardabonusfor piecetrades,wheremovesthatdonotcompletethetradewill havemuchlowerevaluation.

16Thelearningof theevaluationfunctionfor eachindividual searchdepthis quitesimilar to manyof theapproachesdescribedin Section4. Thenovelaspecthereis the learningof several functions,onefor eachlevel of searchdepth,andtheirusefor opponentmodeling.

36

Page 37: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

accuracy with whichthey wereableto predicttheopponent’smovesonanindependenttestsetof another800examples.In two out of threeruns,thelearnedmodelachievedperfectaccuracy. Subsequently, theauthorsintegratedtheir learningalgorithminto anopponentmodelsearchalgorithm,apredecessorof thealgorithmdescribedin (CarmelandMarkovitch 1998b),wheretheunderlyingopponentmodelwasupdatedaftereveryothergame.Thelearnerwassoonableto outperformits non-learningopponents.

Later, CarmelandMarkovitch (1998a)investigatedapproachesfor learningregularautomatathatsimulatethebehavior of their opponents’strategiesandappliedthemtoseveral game-theoreticalproblems.They wereableto show thatanagentusingtheirmethodof inferring a modelof the opponent’s strategy from its interactionbehaviorwassuperiorto severalnon-adaptiveandreinforcement-learningagents.

The InductiveAdversaryModeler(IAM) (Walczak1991;WalczakandDankelII1993) makesthe assumptionthat a humanplayer strives for positionsthat arecog-nitively simple, i.e., they canbe expressedwith a few so-calledchunks. Whereasasystemlike CHUMP (Section5) basesits own movechoicesuponachunklibrary, IAMtries to model its opponent’s chunk library. To that end,it learnsa patterndatabasefrom a collectionof its opponent’sgames.Whenever it is IAM’ sturn to move, it looksfor nearlycompletedchunks,i.e.for chunksthattheopponentcouldcompletewith onemove, andpredictsthat theopponentwill play this move. Multiple move predictionsareresolvedheuristicallyby preferringbiggerandmorereliablechunks.In thedomainof chess,IAM wasableto correctlypredictmorethan10%of themovesin gamesofseveral grandmastersincludingBotwinnik, Karpov, andKasparov. As IAM doesnotalwayspredicta move, the percentageof correctlypredictedmoveswhenit actuallymadea predictionwasevenhigher. Theauthoralsodiscussedtheseapproachesfor thegamesof Go(WalczakandDankelII 1993)andHex (Walczak1992).A variantof thealgorithm,which is only concernedwith themodelingof theopponent’sopeningbookchoices,is discussedin (Walczak1992)and(Walczak1996).

As mentionedabove, Billings et al. (1998b)have notedthatopponentmodelingismorecrucial to successin pokerthanit is in othergames.Therearemany differenttypesof pokerplayers,anda goodplayeradaptshis strategy to his opponent.For ex-ample,trying to call abluff hasalowerchanceof winningagainstaconservativeplayerthanagainsta risky player. Billings et al. (1998b)distinguishbetweentwo differentopponentmodelingtechniques:genericmodelscapturegeneralplayingpatterns(likea bettingactionshouldincreasethe probability that a playerholdsa strongerhand),while specificmodelsreflecttheindividualbehavior of theopponents.Bothtechniquesareimportantfor updatingtheweightsuponwhich LOKI’sdecisionsarebased.

At theheartof LOKI’s architectureis anevaluationfunctioncalledtheTriple Gen-erator, which computesa probabilitydistributionfor thenext actionof a player(fold,check/call, or bet/raise) givenhis cardsandthe currentstateof the game.The othercentralpieceof information is the Weight Table, which, for eachpossibletwo-cardcombination,containsthecurrentestimatesthateachof theopponentsholdsthis par-ticular hand. Genericmodelingis performedaftereachof the opponent’s actionsbycalling theTriple Generatorfor eachpossiblehandandmultiplying thecorrespondingentry in theweight tablewith theprobabilitythatcorrespondsto theactiontheoppo-nenthastaken. Thus,LOKI’s beliefsaboutthe probabilitiesof its opponents’handsareheldconsistentwith their actions.Specificmodelingis basedon statisticsthatare

37

Page 38: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

keptaboutthe frequency of the opponents’actionsin variousgamecontexts. In par-ticular, LOKI computesthemedianhandvaluefor whichanopponentwouldcall a betandits variance.Handswith estimatedvaluesabove thesumof meanandvariancearenot re-weighted(factor �q�Z@r3 ), while handsbelow the meanminusthe variancearepracticallyignored(factor s 0.0).Valuesinbetweenareinterpolated.Thus,LOKI usesinformationgatheredfrom the individual opponents’pastbehavior to re-adjusttheirentriesin theWeightTable.Moredetailsof thecomputationof thegenericandspecificopponentmodelscanbefoundin (Billings etal. 1998b;Billings etal. 2001).

7 Conclusion

In this paper, wehave surveyedresearchin machinelearningfor computergameplay-ing. It is unavoidablethatsuchanoverview is somewhatbiasedby theauthor’sknowl-edgeandinterests,andour sincereapologiesgo to all authorswhosework hadto beignoreddueto ourspaceconstraintsor ignorance.Nevertheless,wehopethatwehaveprovidedthereaderwith agoodstartingpointthatis helpful for identifyingtherelevantworksto startone’sown investigations.

If thereis a conclusionto be drawn from this survey, then it shouldbe that re-searchin gameplayingposesseriousanddifficult problemswhich needto besolvedwith existing or yet-to-be-developedmachinelearningtechniques.Plus,of course,it’ sfun.. .

Acknowledgements

I wouldlike to thankMichaelBuro,SusanEpstein,FernandGobet,GerryTesauro,andPaul Utgoff for thoughtfulcommentsthathelpedto improve thispaper. Specialthanksto Miroslav Kubat.

Pointersto on-lineversionsof many of thecitedpaperscanbefoundin theon-lineBibliographyon MachineLearningin Strategic GamePlayingat tvululwyxdzvzA{l{l{y|S}�~)|�l� ~#�~0��|S}�� |d}Au�z��0� �l�v� ~#zv��~#�Vz .

TheAustrianResearchInstitutefor Artificial Intelligenceis supportedby theAus-trian FederalMinistry of Education,ScienceandCulture.

References

AGRAWAL, R., H. MANNILA , R. SRIKANT, H. TOIVONEN, & A. I . VERKAMO

(1995). Fast discovery of associationrules. In U. M. Fayyad, G. Piatetsky-Shapiro,P. Smyth,andR. Uthurusamy(Eds.),Advancesin KnowledgeDiscov-eryandDataMining, pp.307–328.AAAI Press.

AKL, S. G. & M. NEWBORN (1977). The principal continuationand the killerheuristic.In Proceedingsof the5thAnnualACM ComputerScienceConference,Seattle,WA, pp.466–473.ACM Press.

38

Page 39: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

ALLIS, V. (1988,October).A knowledge-basedapproachof Connect-Four — thegameis solved: White wins. Master’s thesis,Departmentof MathematicsandComputerScience,Vrije Universiteit,Amsterdam,TheNetherlands.

ALLIS, V. (1994).Searching for Solutionsin GamesandArtificial Intelligence. Ph.D. thesis,Universityof Limburg, TheNetherlands.

ANGELINE, P. J. & J. B. POLLACK (1994).Competitiveenvironmentsevolvebettersolutionsfor complex tasks.In Proceedingsof the5th InternationalConferenceonGeneticAlgorithms(GA-93), pp.264–270.

BAIN, M. (1994).LearningLogicalExceptionsin Chess. Ph.D. thesis,Departmentof StatisticsandModellingScience,Universityof Strathclyde,Scotland.

BAIN, M. & A. SRINIVASAN (1995).Inductivelogic programmingwith large-scaleunstructureddata.In K. Furukawa,D. Michie, andS.H. Muggleton(Eds.),Ma-chineIntelligence14, pp.233–267.OxfordUniversityPress.

BAXTER, J., A. TRIDGELL, & L. WEAVER (1998a).A chessprogramthat learnsby combiningTD(lambda)with game-treesearch.In Proceedingsof the 15thInternationalConferenceon Machine Learning(ICML-98), Madison,WI, pp.28–36.MorganKaufmann.

BAXTER, J., A. TRIDGELL, & L. WEAVER (1998b).Experimentsin parameterlearningusingtemporaldifferences.InternationalComputerChessAssociationJournal 21(2),84–99.

BAXTER, J., A. TRIDGELL, & L. WEAVER (2000,September).Learningto playchessusingtemporaldifferences.MachineLearning40(3), 243–263.

BAXTER, J., A. TRIDGELL, & L. WEAVER (2001).Reinforcementlearningandchess.In J.FurnkranzandM. Kubat(Eds.),MachinesthatLearnto PlayGames,Chapter5, pp.91–116.Huntington,NY: NovaSciencePublishers.To appear.

BEAL, D. F. & M. C. SMITH (1997, September).Learningpiecevaluesusingtemporaldifferencelearning.InternationalComputerChessAssociationJour-nal 20(3),147–151.

BEAL, D. F. & M. C. SMITH (1998).First resultsfrom usingtemporaldifferencelearningin Shogi. In H. J. van den Herik andH. Iida (Eds.),ProceedingsoftheFirst InternationalConferenceon ComputersandGames(CG-98), Volume1558of Lecture Notesin ComputerScience, Tsukuba,Japan,pp.113.Springer-Verlag.

BERLINER, H. (1979).Ontheconstructionof evalutionfunctionsfor largedomains.In Proceedingsof the 6th International Joint Conferenceon Artificial Intelli-gence(IJCAI-79), pp.53–55.

BERLINER, H. (1984). Searchvs. knowledge: An analysisfrom the domainofgames.In A. Elithorn andR. Banerji (Eds.),Artificial andHumanIntelligence.New York, NY: Elsevier.

BERLINER, H. & M. S. CAMPBELL (1984).Usingchunkingto solve chesspawnendgames.Artificial Intelligence23, 97–120.

39

Page 40: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

BERLINER, H., G. GOETSCH, M. S. CAMPBELL, & C. EBELING (1990).Mea-suringthe performancepotentialof chessprograms.Artificial Intelligence43,7–21.

BHANDARI , I ., E. COLET, J. PARKER, Z. PINES, R. PRATAP, & K. RAMANUJAM

(1997).AdvancedScout:Datamining andknowledgediscovery in NBA data.DataMining andKnowledgeDiscovery1, 121–125.

BILLINGS, D. (2000a,March). The first internationalRoShamBoprogrammingcompetition.InternationalComputerGamesAssociationJournal23(1), 42–50.

BILLINGS, D. (2000b,March).Thoughtson RoShamBo.InternationalComputerGamesAssociationJournal23(1), 3–8.

BILLINGS, D., D. PAPP, J. SCHAEFFER, & D. SZAFRON (1998b).Opponentmod-eling in poker. In Proceedingsof the 15th National Conferenceon ArtificialIntelligence(AAAI-98), Madison,WI, pp.493–498.AAAI Press.

BILLINGS, D., D. PAPP, J. SCHAEFFER, & D. SZAFRON (1998a).Pokeras atestbedfor machineintelligenceresearch.In R.E.MercerandE.Neufeld(Eds.),Advancesin Artificial Intelligence(AI-98), Vancouver, Canada,pp. 228–238.Springer-Verlag.

BILLINGS, D., L. PENA , J. SCHAEFFER, & D. SZAFRON (1999).Using prob-abilistic knowledgeandsimulationto play Poker.In Proceedingsof the 16thNationalConferenceonArtificial Intelligence(AAAI-99), pp.697–703.

BILLINGS, D., L. PENA , J. SCHAEFFER, & D. SZAFRON (2001).Learningto playstrongpoker. In J.FurnkranzandM. Kubat(Eds.),MachinesthatLearnto PlayGames, Chapter11,pp.225–242.Huntington,NY: NovaSciencePublishers.Toappear.

BISHOP, C. M. (1995). Neural Networksfor Pattern Recognition. Oxford, UK:ClarendonPress.

BOYAN, J. A. (1992). Modular neuralnetworksfor learningcontext-dependentgamestrategies.Master’s thesis,Universityof Cambridge,Departmentof En-gineeringandComputerLaboatory.

BRAFMAN, R. I . & M. TENNENHOLTZ (1999).A near-optimal polynomialtimealogorithmfor learningin stochasticgames.In Proceedingsof the16thInterna-tional Joint ConferenceonArtificial Intelligence(IJCAI-99), pp.734–739.

BRATKO, I . & R. K ING (1994). Applicationsof inductive logic programming.SIGARTBulletin5(1), 43–49.

BRATKO, I . & D. M ICHIE (1980).A representationof pattern-knowledgein chessendgames.In M. Clarke(Ed.),Advancesin ComputerChess2, pp. 31–54.Ed-inburghUniversityPress.

BRATKO, I . & S. H. MUGGLETON (1995,November).Applicationsof inductivelogic programming.Communicationsof theACM 38(11),65–70.

BRUGMANN, B. (1993, March). Monte Carlo Go. Available fromftp://ftp.cse.cuhk.edu.hk/pub/neuro/GO/mcgo.tex. Unpub-lishedmanuscript.

40

Page 41: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

BURO, M. (1995a).ProbCut:An effectiveselective extensionof the *����G�j4k algo-rithm. InternationalComputerChessAssociationJournal 18(2),71–76.

BURO, M. (1995b).Statisticalfeaturecombinationfor theevaluationof gameposi-tions.Journalof Artificial IntelligenceResearch 3, 373–382.

BURO, M. (1997).TheOthellomatchof theyear:TakeshiMurakamivs.Logistello.InternationalComputerChessAssociationJournal20(3), 189–193.

BURO, M. (1998).From simplefeaturesto sophisticatedevaluationfunctions.InH. J. van denHerik andH. Iida (Eds.),Proceedingsof the First InternationalConferenceon ComputersandGames(CG-98), Volume1558of Lecture Notesin ComputerScience, Tsukuba,Japan,pp.126–145.Springer-Verlag.

BURO, M. (1999a).Experimentswith Multi-ProbCutanda new high-qualityeval-uationfunctionfor Othello.In H. J.vandenHerik andH. Iida (Eds.),GamesinAI Research. Maastricht,TheNetherlands.

BURO, M. (1999b).Towardopeningbooklearning.InternationalComputerChessAssociationJournal 22(2),98–102.ResearchNote.

BURO, M. (2001).Toward openingbook learning.In J. FurnkranzandM. Kubat(Eds.),Machinesthat Learnto Play Games, Chapter4, pp.81–89.Huntington,NY: Nova SciencePublishers.To appear.

CAMPBELL, M. S. (1999,November).Knowledgediscovery in DeepBlue. Com-municationsof theACM 42(11),65–67.

CARMEL, D. & S. MARKOVITCH (1993).Learningmodelsof opponent’sstrategyin gameplaying.NumberFS-93-02,Menlo Park,CA, pp.140–147.TheAAAIPress.

CARMEL, D. & S. MARKOVITCH (1998a,July).Model-basedlearningof interac-tion strategiesin multiagentsystems.Journal of ExperimentalandTheoreticalArtificial Intelligence10(3), 309–332.

CARMEL, D. & S. MARKOVITCH (1998b,March).Pruningalgorithmsfor multi-modeladversarysearch.Artificial Intelligence99(2), 325–355.

CAZENAVE, T. (1998).Metaprogrammingforcedmoves. In H. Prade(Ed.), Pro-ceedingsof the13thEuropeanConferenceonArtificial Intelligence(ECAI-98),Brighton,U.K., pp.645–649.Wiley.

CHAKRABARTI , S. (2000,January).Datamining for hypertext: A tutorial survey.SIGKDDexplorations1(2), 1–11.

CHASE, W. G. & H. A. SIMON (1973).Themind’s eye in chess.In W. G. Chase(Ed.),Visual InformationProcessing:Proceedingsof the8th AnnualCarnegiePsychologySymposium. New York: AcademicPress.Reprintedin (Collins andSmith1988).

CLARKE, M. R. B. (1977).A quantitativestudyof king andpawn againstking. InM. Clarke(Ed.),Advancesin ComputerChess, pp.108–118.EdinburghUniver-sity Press.

41

Page 42: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

COLLINS, A. & E. E. SMITH (Eds.)(1988).Readingsin CognitiveScience. MorganKaufmann.

COLLINS, G., L. BIRNBAUM , B. KRULWICH, & M. FREED (1993).The role ofself-modelsin learningto plan. In A. L. Meyrowitz and S. Chipman(Eds.),Foundationsof KnowledgeAcquisition:MachineLearning, pp.83–116.Boston:Kluwer AcademicPublishers.

CRITES, R. & A. BARTO (1998).Elevatorgroupcontrolusingmultiple reinforce-mentlearningagents.MachineLearning33, 235–262.

DAHL, F. A. (2001).Honte,ago-playingprogramusingneuralnets.In J.FurnkranzandM. Kubat(Eds.),MachinesthatLearnto PlayGames, Chapter10,pp.205–223.Huntington,NY: Nova SciencePublishers.To appear.

DE GROOT, A. D. (1965).ThoughtandChoicein Chess. TheHague,TheNether-lands:Mouton.

DE GROOT, A. D. & F. GOBET (1996).PerceptionandMemoryin Chess:Heuris-ticsof theProfessionalEye. Assen,TheNetherlands:VanGorcum.

DE RAEDT, L. (Ed.)(1995).Advancesin InductiveLogicProgramming, Volume32of Frontiersin Artificial IntelligenceandApplications. IOSPress.

DIETTERICH, T. G. (1997,Winter).Machinelearningresearch:Fourcurrentdirec-tions.AI Magazine18(4),97–136.

DIETTERICH, T. G. & N. S. FLANN (1997).Explanation-basedandreinforcementlearning:A unifiedview. MachineLearning28(2/3),169–210.

DONNINGER, C. (1996).CHE: A graphicallanguagefor expressingchessknowl-edge.InternationalComputerChessAssociationJournal19(4), 234–241.

EGNOR, D. (2000,March).IocainePowder. InternationalComputerGamesAsso-ciationJournal23(1), 33–35.ResarchNote.

ENDERTON, H. D. (1991,December).The GolemGo program.TechnicalReportCMU-CS-92-101,Schoolof ComputerScience,Carnegie-MellonUniversity.

ENGLE, R. W. & L. BUKSTEL (1978).MemoryprocessesamongBridgeplayersof differingexpertise.AmericanJournal of Psychology91, 673–689.

EPSTEIN, S. L. (1992).Prior knowledgestrengthenslearningto controlsearchinweaktheorydomains.InternationalJournal of IntelligentSystems7, 547–586.

EPSTEIN, S. L. (1994a).For theright reasons:TheFORRarchitecturefor learningin a skill domain.CognitiveScience18(3), 479–511.

EPSTEIN, S. L. (1994b).Identifying the right reasons:Learningto filter decisionmakers.In R. GreinerandD. Subramanian(Eds.),Proceedingsof theAAAIFallSymposiumonRelevance, pp.68–71.AAAI Press.TechnicalReportFS-94-02.

EPSTEIN, S. L. (1994c).Towardanidealtrainer. MachineLearning15, 251–277.

EPSTEIN, S. L. (2001). Learning to play expertly: A tutorial on hoyle. InJ. FurnkranzandM. Kubat(Eds.),Machinesthat Learnto Play Games, Chap-ter8, pp.153–178.Huntington,NY: Nova SciencePublishers.To appear.

42

Page 43: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

EPSTEIN, S. L., J. J. GELFAND, & J. LESNIAK (1996).Pattern-basedlearningandspatially-orientedconceptformationin a multi-agent,decision-makingexpert.ComputationalIntelligence12(1), 199–221.

FAWCETT, T. E. & F. PROVOST (1997).Adaptivefrauddetection.DataMiningandKnowledgeDiscovery1(3), 291–316.

FAWCETT, T. E. & P. E. UTGOFF (1992).Automaticfeaturegenerationfor prob-lem solvingsystems.In D. SleemanandP. Edwards(Eds.),Proceedingsof the9thInternationalConferenceonMachineLearning, pp.144–153.MorganKauf-mann.

FAYYAD, U. M., S. G. DJORGOVSKI , & N. WEIR (1996,Summer).Fromdigitizedimagesto onlinecatalogs— datamining a sky survey. AI Magazine17(2), 51–66.

FAYYAD, U. M., G. PIATETSKY-SHAPIRO, & P. SMYTH (1996,Fall). Fromdatamining to knowledgediscovery in databases.AI Magazine17(3), 37–54.

FAYYAD, U. M., G. PIATETSKY-SHAPIRO, P. SMYTH, & R. UTHURUSAMY

(Eds.)(1995).Advancesin KnowledgeDiscoveryandDataMining.MenloPark:AAAI Press.

FEIGENBAUM , E. A. (1961).The simulationof verbal learningbehavior. In Pro-ceedingsof theWesternJoint ComputerConference, pp.121–132.Reprintedin(Shavlik andDietterich1990).

FEIGENBAUM , E. A. & J. FELDMAN (Eds.)(1963).ComputersandThought. NewYork: McGraw-Hill. Republishedby AAAI Press/MIT Press,1995.

FLANN, N. S. (1989).Learningappropriateabstractionsfor planningin formationproblems.In A. M. Segre(Ed.),Proceedingsof the6th InternationalWorkshoponMachineLearning, pp.235–239.MorganKaufmann.

FLANN, N. S. (1990).Applyingabstractionandsimplificationto learnin intractabledomains.In B. W. PorterandR. Mooney (Eds.),Proceedingsof the7thInterna-tional ConferenceonMachineLearning, pp.277–285.MorganKaufmann.

FLANN, N. S. (1992).Correct Abstraction in Counter-Planning: A Knowledge-CompilationApproach. Ph.D. thesis,OregonStateUniversity.

FLANN, N. S. & T. G. DIETTERICH (1989).A studyof explanation-basedmethodsfor inductivelearning.MachineLearning4, 187–226.

FLINTER, S. & M. T. KEANE (1995).Ontheautomaticgenerationof caselibrariesby chunkingchessgames.In M. Velosoand A. Aamodt (Eds.),Proceedingsof the1st InternationalConferenceon CaseBasedReasoning(ICCBR-95), pp.421–430.SpringerVerlag.

FOGEL, D. B. (1993).Using evolutionaryprogrammingto constructneuralnet-worksthatarecapableof playingtic-tac-toe.In Proceedingsof the IEEE Inter-national Conferenceon Neural Networks(ICNN-93), SanFrancisco,pp. 875–879.

43

Page 44: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

FREY, P. W. (1986).Algorithmic strategiesfor improving theperformanceof gameplaying programs.PhysicaD 22, 355–365.Also publishedin D. Farmer, A.Lapedes,N. Packard,andB. Wendroff (eds.)Evolution,GamesandLearning:Modelsfor Adaptationin MachineandNature, North-Holland,1986.

FUDENBERG, D. & D. K. LEVINE (1998).TheTheoryof Learningin Games. Se-riesonEconomicLearningandSocialEvolution.Cambridge,MA: MIT Press.

FURNKRANZ, J. (1996,September).Machinelearningin computerchess:Thenextgeneration.InternationalComputerChessAssociationJournal19(3), 147–160.

FURNKRANZ, J. (1997).Knowledgediscovery in chessdatabases:A researchpro-posal.TechnicalReportOEFAI-TR-97-33,AustrianResearchInstitutefor Arti-ficial Intelligence.

FURNKRANZ, J. & M. KUBAT (Eds.)(1999).WorkshopNotes:MachineLearningin GamePlaying, Bled, Slovenia.16th InternationalConferenceon MachineLearning(ICML-99).

FURNKRANZ, J. & M. KUBAT (Eds.)(2001).MachinesthatLearnto PlayGames.Huntington,NY: Nova SciencePublishers.To appear.

FURNKRANZ, J., B. PFAHRINGER, H. KAINDL, & S. KRAMER (2000).Learningto useoperationaladvice.In W. Horn (Ed.),Proceedingsof the14thEuropeanConferenceonArtificial Intelligence(ECAI-00), Berlin. In press.

GASSER, R. (1995).EfficientlyHarnessingComputationalResourcesfor Exhaus-tiveSearch. Ph.D. thesis,ETH Zurich,Switzerland.

GEORGE, M. & J. SCHAEFFER (1990). Chunkingfor experience.InternationalComputerChessAssociationJournal 13(3),123–132.

GHERRITY, M. (1993).A Game-LearningMachine. Ph. D. thesis,University ofCalifornia,SanDiego,CA.

GINSBERG, M. L. (1998,Winter).Computers,gamesandtherealworld. ScientificAmericanPresents9(4). SpecialIssueonExploringIntelligence.

GINSBERG, M. L. (1999).GIB: Stepstowardanexpert-level bridge-playingpro-gram.In Proceedingsof theInternationalJoint Conferenceon Artificial Intelli-gence(IJCAI-99), Stockholm,Sweden,pp.584–589.

GOBET, F. (1993).A computermodelof chessmemory. In Proceedingsof the15thAnnualMeetingof theCognitiveScienceSociety, pp.463–468.

GOBET, F. & P. J. JANSEN (1994).Towardsa chessprogrambasedon a modelofhumanmemory. In H. J.vandenHerik, I. S.Herschberg, andJ.W. H. M. Uiter-wijk (Eds.),Advancesin ComputerChess7, pp.35–60.Universityof Limburg.

GOBET, F. & H. A. SIMON (2001). Human learning in game playing. InJ. FurnkranzandM. Kubat(Eds.),Machinesthat Learnto Play Games, Chap-ter3, pp.61–80.Huntington,NY: Nova SciencePublishers.To appear.

GOLDBERG, D. E. (1989).GeneticAlgorithmsin Search, Optimizationand Ma-chineLearning. Reading,MA: Addison-Wesley.

44

Page 45: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

GOULD, J. & R. A. LEVINSON (1994).Experience-basedadaptive search.In R. S.MichalskiandG.Tecuci(Eds.),MachineLearning:A Multi-StrategyApproach,pp.579–604.MorganKaufmann.

GREENBLATT, R. D., D. E. EASTLAKE I I I , & S. D. CROCKER (1967).TheGreen-blatt chessprogram.In Proceedingsof theFall Joint ComputerConference, pp.801–810.AmericanFederationof InformationProcessingSocieties.Reprintedin (Levy 1988).

GREER, K. R. C., P. C. OJHA , & D. A. BELL (1999).A pattern-orientedapproachto moveordering:thechessmapsheuristic.InternationalComputerChessAsso-ciationJournal22, 13–21.

HAMMOND, K. J. (1989).Case-BasedPlanning: Viewing Planningasa MemoryTask. Boston,MA: AcademicPress.

HATONEN, K., M. KLEMETTINEN, H. MANNILA , P. RONKAINEN, & H. TOIVO-NEN (1996). Knowledge discovery from telecommunicationnetwork alarmdatabases.In Proceedingsof the 12th International Conferenceon Data En-gineering(ICDE-96).

HEINZ, E. A. (1999a,March). Endgamedatabasesand efficient index schemesfor chess.InternationalComputerChessAssociationJournal 22(1),22–32.Re-searchNote.

HEINZ, E. A. (1999b,June).Knowledgeableencodingandqueryingof endgamedatabases.InternationalComputerChessAssociationJournal 22(2), 81–97.

HOLDING, D. H. (1985).ThePsychologyof ChessSkill. Hillsdale,NJ: LawrenceErlbaumAssociates.

HSU, F.-H., T. S. ANANTHARAMAN, M. S. CAMPBELL, & A. NOWATZYK

(1990a).DeepThought.In T. A. MarslandandJ.Schaeffer (Eds.),Computers,Chess,andCognition, Chapter5, pp.55–78.Springer-Verlag.

HSU, F.-H., T. S. ANANTHARAMAN, M. S. CAMPBELL, & A. NOWATZYK

(1990b,October).A grandmasterchessmachine.ScientificAmerican263(4),44–50.

HSU, J.-M. (1985).A strategic gameprogramthat learnsfrom mistakes.Master’sthesis,NorthwesternUniversity, Evanston,IL.

HYATT, R. M. (1999,March).Book learning— a methodologyto tuneanopeningbook automatically. InternationalComputerChessAssociationJournal 22(1),3–12.

I IDA , H., J. W. H. M. UITERWIJK , H. J. VAN DEN HERIK , & I . S. HERSCHBERG

(1994).Thoughtson theapplicationof opponent-modelsearch.In H. J.vandenHerik andJ. W. H. M. Uiterwijk (Eds.),Advancesin ComputerChess7, pp.61–78.Maastricht,Holland:RijksuniversiteitLimburg.

INUZUKA , N., H. FUJIMOTO, T. NAKANO, & H. ITOH (1999).Pruningnodesinthe alpha-betamethodusinginductive logic programming.SeeFurnkranzandKubat(1999).

45

Page 46: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

ISABELLE, J.-F. (1993). Auto-apprentissagea l’aide de resauxde neurones,defonctionsheuristiquesutiliseesdansles jeux strategiques.Master’s thesis,Uni-versityof Montreal.In French.

JANSEN, P. J. (1990).Problematicpositionsandspeculativeplay. In T. A. MarslandandJ. Schaeffer (Eds.),Computers,Chess,and Cognition, pp. 169–181.NewYork: Springer-Verlag.

JANSEN, P. J. (1992a).KQKR: Assessingthe utility of heuristics.InternationalComputerChessAssociationJournal 15(4),179–191.

JANSEN, P. J. (1992b).KQKR: Awarenessof a fallible opponent.InternationalComputerChessAssociationJournal 15(3),111–131.

JANSEN, P. J. (1993,March).KQKR: Speculatively thwartinga humanopponent.InternationalComputerChessAssociationJournal16(1), 3–17.

JUNGHANNS, A. & J. SCHAEFFER (1997). Searchversusknowledgein game-playingprogramsrevisited.In Proceedingsof the15thInternationalJoint Con-ferenceonArtificial Intelligence, Nagoya,Japan,pp.692–697.

KAINDL, H. (1982).Positionallong-rangeplanningin computerchess.In M. R. B.Clarke(Ed.),Advancesin ComputerChess3, pp.145–167.PergamonPress.

KASPAROV, G. (1998,March).My 1997experiencewith DeepBlue. InternationalComputerChessAssociationJournal 21(1),45–51.

KERNER, Y. (1995).Learningstrategiesfor explanationpatterns:Basicgamepat-ternswith applicationto chess.In M. VelosoandA. Aamodt(Eds.),Proceed-ingsof the1stInternationalConferenceonCase-BasedReasoning(ICCBR-95),Volume1010of Lecture Notesin Artificial Intelligence, Berlin, pp. 491–500.Springer-Verlag.

KOJIMA , T. (1995).A modelof acquisitionandrefinementof deductiverulesin thegameof Go.Master’s thesis,Universityof Tokyo. In Japanese.

KOJIMA , T., K. UEDA , & S. NAGANO (1997).An evolutionaryalgorithmextendedby ecologicalanalogyandits applicationto thegameof Go. In Proceedingsofthe 15th International Joint Conferenceon Artificial Intelligence(IJCAI-97),Nagoya,Japan,pp.684–689.

KOJIMA , T. & A. YOSHIKAWA (2001).Acquisitionof go knowledgefrom gamerecords.In J. Furnkranzand M. Kubat (Eds.), Machinesthat Learn to PlayGames, Chapter9, pp.179–204.Huntington,NY: Nova SciencePublishers.Toappear.

KOLLER, D. & A. J. PFEFFER (1997).Representationsandsolutionsfor game-theoreticproblems.Artificial Intelligence94(1-2),167–215.

KRULWICH, B. (1993).Flexible Learningin a Multi-ComponentPlanningSystem.Ph.D. thesis,TheInstitutefor theLearningSciences,NorthwesternUniversity,Evanston,IL. TechnicalReport#46.

KRULWICH, B., L. BIRNBAUM , & G. COLLINS (1995).Determiningwhatto learnthroughcomponent-taskmodeling. In Proceedingsof the 14th InternationalJoint ConferenceonArtificial Intelligence(IJCAI-95), pp.439–445.

46

Page 47: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

KUVAYEV, L. (1997).Learningto playHearts.In Proceedingsof the14thNationalConferenceonArtificial Intelligence(AAAI-97), Providence,RI, pp.837.AAAIPress.ExtendedAbstract.

LAIRD, J. E., A. NEWELL, & P. S. ROSENBLOOM (1987).SOAR: An architecturefor generalintelligence.Artificial Intelligence33, 1–64.

LAKE, R., J. SCHAEFFER, & P. LU (1994).Solvinglargeretrogradeanalysisprob-lemsusinga networkof workstations.In H. J. vandenHerik andJ. W. H. M.Uiterwijk (Eds.),Advancesin ComputerChess7, pp.135–162.Maastricht,Hol-land:RijksuniversiteitLimburg.

LAVRAC, N. (1999).Machinelearningfor datamining in medicine.In W. Horn,Y. Shahar, G. Lindberg, S. Andreassen,andJ. Wyatt (Eds.),Artificial Intelli-gencein Medicine, pp.47–62.Berlin: Springer-Verlag.

LAVRAC, N. & S. DZEROSKI (1993).InductiveLogic Programming: TechniquesandApplications. Ellis Horwood.

LEE, K.-F. & S. MAHAJAN (1988).A patternclassificationapporachto evaluationfunctionlearning.Artificial Intelligence36, 1–25.

LEE, K.-F. & S. MAHAJAN (1990). The developmentof a world classOthelloprogram.Artificial Intelligence43, 21–36.

LEOUSKI , A. & P. E. UTGOFF (1996,March).What a neuralnetworkcan learnaboutOthello.TechnicalReportUM-CS-1996-010,ComputerScienceDepart-ment,LederleGraduateResearchCenter, Universityof Massachusetts,Amherst,MA.

LEVINSON, R. A., B. BEACH, R. SNYDER, T. DAYAN, & K. SOHN (1992).Adaptive-predictivegame-playingprograms.Journal of ExperimentalandThe-oreticalArtificial Intelligence4(4).

LEVINSON, R. A. & R. SNYDER (1991). Adaptive pattern-orientedchess.InL. BirnbaumandG. Collins (Eds.),Proceedingsof the8th InternationalWork-shoponMachineLearning(ML-91), pp.85–89.MorganKaufmann.

LEVINSON, R. A. & R. SNYDER (1993,September).DISTANCE: Towardtheuni-fication of chessknowledge.InternationalComputerChessAssociationJour-nal 16(3),123–136.

LEVY, D. N. (1986).ChessandComputers. London:Batsford.

LEVY, D. N. (Ed.) (1988).ComputerChessCompendium. London:Batsford.

LEVY, D. N. & D. F. BEAL (Eds.)(1989).Heuristic Programmingin ArtificialIntelligence— TheFirst ComputerOlympiad. Chichester, England:Ellis Hor-wood.

LEVY, D. N. & M. NEWBORN (1991).How ComputersPlay Chess. ComputerSciencePress.

L ITTMAN, M. L. (1994).Markov gamesasa framework for multi-agentreinforce-mentlearning.In Proceedingsof the11thInternationalConferenceonMachineLearning(ML-94), New Brunswick,NJ,pp.157–163.MorganKaufmann.

47

Page 48: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

LYNCH, M. (1997,June).Neurodraughts:An applicationof temporaldifferencelearningto Draughts.Final YearProjectReport,Departmentof ComputerSci-enceandInformationSystems,Universityof Limerick, Ireland.

MARSLAND, T. A. (1985). Evaluation-functionfactors.International ComputerChessAssociationJournal8(2), 47–57.

MATHEUS, C. J. (1989).A constructiveinductionframework. In Proceedingsof the6th InternationalWorkshoponMachineLearning, pp.474–475.

MATSUBARA , H. (Ed.)(1998).Advancesin ComputerShogi2. KyouritsuShuppancorp.ISBN 4-320-02892-9.In Japanese.

MCCORDUCK , P. (1979).MachinesWhoThink—A PersonalInquiry into theHis-tory andProspectsof Artificial Intelligence. SanFrancisco,CA: W. H. FreemanandCompany.

MEULEN, M. V. D. (1989).Weightassessmentin evaluationfunctions.In D. F. Beal(Ed.),Advancesin ComputerChess5, pp.81–89.Amsterdam:Elsevier.

M ICHALSKI , R. S. (1969).On thequasi-minimalsolutionof thecoveringproblem.In Proceedingsof the5th InternationalSymposiumon InformationProcessing(FCIP-69), VolumeA3 (SwitchingCircuits),Bled,Yugoslavia, pp.125–128.

M ICHALSKI , R. S., I . BRATKO, & M. KUBAT (Eds.)(1998).Machine LearningandDataMining: MethodsandApplications. JohnWiley & Sons.

M ICHALSKI , R. S. & P. NEGRI (1977).An experimenton inductive learninginchessendgames.In E.W. ElcockandD. Michie (Eds.),MachineIntelligence8,pp.175–192.Chichester, U.K.: Ellis Horwood.

M ICHIE, D. (1961).Trial anderror. In S. A. BarnettandA. McLaren(Eds.),Sci-enceSurvey, Part 2, pp.129–145.Harmondsworth,U.K.: Penguin.Reprintedin(Michie 1986).

M ICHIE, D. (1963).Experimentson the mechanizationof game-learning– PartI. Characterizationof the modelandits parameters.TheComputerJournal 6,232–236.

M ICHIE, D. (1982).Experimentson themechanizationof game-learning2 – rule-basedlearningandthehumanwindow. TheComputerJournal25(1), 105–113.

M ICHIE, D. (1983). Game-playingprogramsand the conceptualinterface. InM. Bramer(Ed.), ComputerGame-Playing:Theoryand Practice, Chapter1,pp.11–25.Chichester, England:Ellis Horwood.

M ICHIE, D. (1986).On Machine Intelligence(2nd Edition ed.).Chichester, UK:Ellis HorwoodLimited.

M ICHIE, D. & I . BRATKO (1991,March).Commentsto ’chunkingfor experience’.InternationalComputerChessAssociationJournal18(1), 18.

M INSKY, M. (1963).Stepstowardartificial intelligence.In E. A. FeigenbaumandJ.Feldman(Eds.),ComputersandThought. New York: McGraw-Hill.

M INSKY, M. & S. A. PAPERT (1969).Perceptrons: Introductionto ComputationalGeometry. MIT Press.ExpandedEdition1990.

48

Page 49: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

M INTON, S. (1984).Constraintbasedgeneralization:Learninggameplayingplansfrom singleexamples.In Proceedingsof the2ndNationalConferenceonArtifi-cial Intelligence(AAAI-84), Austin,TX, pp.251–254.

M INTON, S. (1990).Quantitativeresultsconcerningtheutility of explanation-basedlearning.Artificial Intelligence42, 363–392.

M ITCHELL, D. H. (1984). Using featuresto evaluatepositionsin experts’ andnovices’Othellogames.Master’sthesis,NorthwesternUniversity, Evanston,IL.

M ITCHELL, T. M. (1997a,Fall). Doesmachinelearningreally work? AI Maga-zine18(3), 11–20.

M ITCHELL, T. M. (1997b).MachineLearning. McGraw Hill.

M ITCHELL, T. M., R. M. KELLER, & S. KEDAR-CABELLI (1986).Explanation-basedgeneralization:A unifying view. MachineLearning1(1), 47–80.

M ITCHELL, T. M., P. E. UTGOFF, & R. BANERJI (1983).Learningby experi-mentation:Acquiring andrefiningproblem-solvingheuristics.In R. Michalski,J. Carbonell,andT. Mitchell (Eds.),Machine Learning: An Artificial Intelli-genceApproach, Chapter6. SanMateo,CA: MorganKaufmann.

MONDERER, D. & M. TENNENHOLTZ (1997).Dynamic non-bayesiandecisionmaking.Journal of Artificial IntelligenceResearch 7, 231–248.

MORALES, E. (1991).Learningfeaturesby experimentationin chess.In Y. Ko-dratoff (Ed.), Proceedingsof the 5th EuropeanWorking Sessionon Learning(EWSL-91), pp.494–511.SpringerVerlag.

MORALES, E. (1994).Learningpatternsfor playingstrategies.InternationalCom-puterChessAssociationJournal17(1),15–26.

MORALES, E. (1997).PAL: A pattern-basedfirst-orderinductivesystem.MachineLearning26(2-3),227–252.SpecialIssueon InductiveLogic Programming.

MORIARTY, D. & R. M I IKKULAINEN (1994).Evolving neuralnetworksto focusminimaxsearch.In Proceedingsof 12thNationalConferenceonArtificial Intel-ligence(AAAI-94), pp.1371–1377.

MOSTOW, D. J. (1981).MechanicalTransformationof TaskHeuristicsinto Oper-ational Procedures. Ph.D. thesis,Carnegie Mellon University, DepartmentofComputerScience.

MOSTOW, D. J. (1983).Machinetransformationof adviceinto a heuristicsearchprocedure.In MachineLearning: An Artificial IntelligenceApproach, pp.367–403.MorganKaufmann.

MUGGLETON, S. H. (1988). Inductive acquisitionof chessstrategies. In J. E.Hayes,D. Michie, andJ.Richards(Eds.),MachineIntelligence11, Chapter17,pp.375–387.ClarendonPress.

MUGGLETON, S. H. (1990). InductiveAcquisitionof ExpertKnowledge. TuringInstitutePress.Addison-Wesley.

MUGGLETON, S. H. (Ed.) (1992). InductiveLogic Programming. London: Aca-demicPressLtd.

49

Page 50: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

MUGGLETON, S. H. & L. DE RAEDT (1994).InductiveLogic Programming:The-ory andmethods.Journalof LogicProgramming19,20, 629–679.

M ULLER, M. (1999).ComputerGo: A researchagenda.InternationalComputerChessAssociationJournal22(2), 104–112.

NAKANO, T., N. INUZUKA , H. SEKI , & H. ITOH (1998).InducingShogiheuristicsusing inductive logic programming.In D. Page(Ed.), Proceedingsof the 8thInternationalConferenceon InductiveLogic Programming(ILP-98), Madison,WI, pp.155–164.Springer.

NEWBORN, M. (1996).Kasparov versusDeepBlue: ComputerChessComesofAge. New York: Springer-Verlag.

NITSCHE, T. (1982).A learningchessprogram.In M. R. B. Clarke(Ed.),Advancesin ComputerChess3, pp.113–120.PergamonPress.

NUNN, J. (1992).Secretsof RookEndings. Batsford.

NUNN, J. (1994a).Extractinginformationfromendgamedatabases.In H. J.vandenHerik, I. S.Herschberg, andJ.W. H. M. Uiterwijk (Eds.),Advancesin ComputerChess7, pp.19–34.Maastricht,TheNetherlands:Universityof Limburg.

NUNN, J. (1994b).Secretsof PawnlessEndings. Batsford.

NUNN, J. (1995).Secretsof Minor-PieceEndings. Batsford.

OPDAHL, A. L. & B. TESSEM (1994).Long-termplanningin computerchess.InH. J.vandenHerik, I. S.Herschberg,andJ.W. H. M. Uiterwijk (Eds.),Advancesin ComputerChess7. Universityof Limburg.

PATERSON, A. (1983).An attemptto useCLUSTERto synthesisehumanlyintel-ligible subproblemsfor theKPK chessendgame.TechnicalReportUIUCDCS-R-83-1156,Universityof Illinois, Urbana,IL.

PITRAT, J. (1976a).A programto learnto playchess.In Chen(Ed.),PatternRecog-nition andArtificial Intelligence, pp.399–419.New York: AcademicPress.

PITRAT, J. (1976b).Realizationof aprogramlearningtofind combinationsatchess.In J. C. Simon (Ed.), ComputerOrientedLearning Processes, Volume14 ofNATO AdvancedStudyInstituteSeries,SeriesE: AppiedScience. Leyden: No-ordhoff.

PITRAT, J. (1977).A chesscombinationprogramwhichusesplans.Artificial Intel-ligence8, 275–321.

PLAAT, A. (1996).Research Re: Search and Re-Search. Ph. D. thesis,ErasmusUniversity, Rotterdam,TheNetherlands.

POLLACK , J. B. & A. D. BLAIR (1998).Co-evolutionin thesuccessfullearningofbackgammonstrategy. MachineLearning32(1), 225–240.

PUGET, J.-F. (1987).Goalregressionwith opponent.In Progressin MachineLearn-ing, pp.121–137.SigmaPress.

QUINLAN, J. R. (1979).Discovering rulesby inductionfrom large collectionsofexamples.In D. Michie (Ed.),ExpertSystemsin theMicro Electronic Age, pp.168–201.EdinburghUniversityPress.

50

Page 51: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

QUINLAN, J. R. (1983).Learningefficient classificationproceduresandtheir ap-plication to chessendgames.In R. S. Michalski, J. G. Carbonell,andT. M.Mitchell (Eds.), Machine Learning: An Artificial IntelligenceApproach, pp.463–482.Palo Alto: Tioga.

RATTERMANN, M. J. & S. L. EPSTEIN (1995).Skilledlike aperson:A comparisonof humanandcomputergameplaying.In Proceedingsof the17thAnnualCon-ferenceof theCognitiveScienceSociety, Pittsburgh,PA, pp.709–714.LawrenceErlbaumAssociates.

REITMAN, J. S. (1976).Skilled perceptionin Go: Deducingmemorystructuresfrom inter-responsetimes.CognitivePsychology3, 336–356.

RIESBECK , C. K. & R. C. SCHANK (1989).InsideCase-BasedReasoning. Hills-dale,NJ:LawrenceErlbaumAssociates.

ROBERTIE, B. (1992).Carbonversussilicon: Matchingwits with TD-Gammon.InsideBackgammon2(2), 14–22.

ROBERTIE, B. (1993).LearningfromtheMachine: Robertievs.TD-Gammon. Ar-lington,MA: GammonPress.

ROSENBLATT, F. (1958).The perceptron:a probabalisticmodel for informationstorageandorganizationin thebrain.PsychologicalReview 65, 386–408.

ROYCROFT, A. J. (1988).Expertagainstoracle.In J. E. Hayes,D. Michie, andJ.Richards(Eds.),Machine Intelligence11, pp. 347–373.Oxford,UK: OxfordUniversityPress.

RUMELHART, D. E. & J. L. MCCLELLAND (Eds.) (1986).Parallel DistributedProcessing:Explorationsin theMicrostructure of Cognition, Volume1: Foun-dations.Cambridge,MA: MIT Press.

SAMUEL, A. L. (1959).Somestudiesin machinelearningusingthegameof check-ers. IBM Journal of Research and Development3(3), 211–229.Reprintedin(FeigenbaumandFeldman1963)(with anadditionalappendix)andin (ShavlikandDietterich1990).

SAMUEL, A. L. (1960).Programmingcomputersto play games.In AdvancesinComputers1, pp.165–192.AcademicPress.

SAMUEL, A. L. (1967).Somestudiesin machinelearningusingthegameof check-ers.ii - recentprogress.IBM Journalof Research andDevelopment11(6),601–617.

SCHAEFFER, J. (1983).Thehistoryheuristic.InternationalComputerChessAsso-ciationJournal6(3), 16–19.

SCHAEFFER, J. (1986).Experimentsin KnowledgeandSearch. Ph.D. thesis,Uni-versityof Waterloo,Canada.

SCHAEFFER, J. (1989).Thehistoryheuristicandalpha-betasearchenhancementsin practice.IEEEPatternAnalysisandMachineIntelligence11(11),1203–1212.

SCHAEFFER, J. (1997).One Jump Ahead— ChallengingHumanSupremacyinCheckers. New York: Springer-Verlag.

51

Page 52: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

SCHAEFFER, J. (2000).Thegamescomputers(andpeople)play. In M. V. Zelkowitz(Ed.),Advancesin Computers, Volume50,pp.189–266.AcademicPress.

SCHAEFFER, J., J. CULBERSON, N. TRELOAR, B. KNIGHT, P. LU, &D. SZAFRON (1992).A world championshipcalibercheckersprogram.Arti-ficial Intelligence53(2-3),273–289.

SCHAEFFER, J., R. LAKE, P. LU, & M. BRYANT (1996).Chinook: The worldman-machinecheckerschampion.AI Magazine17(1),21–29.

SCHAEFFER, J. & A. PLAAT (1997,June).Kasparov versusDeepBlue: The re-match.InternationalComputerChessAssociationJournal20(2), 95–101.

SCHANK , R. C. (1986).ExplanationPatterns: UnderstandingMechanically andCreatively. Hillsdale,NJ:LawrenceErlbaum.

SCHERZER, T., L. SCHERZER, & D. TJADEN (1990).Learningin Bebe.In T. A.MarslandandJ.Schaeffer (Eds.),Computers,Chess,andCognition,Chapter12,pp.197–216.SpringerVerlag.

SCHMIDT, M. (1994).Temporal-differencelearningand chess.Technicalreport,ComputerScienceDepartment,Universityof Aarhus,Aarhus,Denmark.

SCHRAUDOLPH, N. N., P. DAYAN, & T. J. SEJNOWSKI (1994).Temporaldif-ferencelearning of position evaluation in the gameof go. In J. D. Cowan,G. Tesauro,andJ.Alspector(Eds.),Advancesin Neural InformationProcessing6, pp.817–824.SanFrancisco:MorganKaufmann.

SHAPIRO, A. D. (1987).Structured Inductionin ExpertSystems. Turing InstitutePress.Addison-Wesley.

SHAPIRO, A. D. & D. M ICHIE (1986). A self commentingfacility for induc-tivelysynthesizedendgameexpertise.In D. F. Beal(Ed.),Advancesin ComputerChess4, pp.147–165.Oxford,U.K.: PergamonPress.

SHAPIRO, A. D. & T. NIBLETT (1982).Automaticinductionof classificationrulesfor a chessendgame.In M. R. B. Clarke(Ed.),Advancesin ComputerChess3,pp.73–92.Oxford,U.K.: PergamonPress.

SHAVLIK , J. W. & T. G. DIETTERICH (Eds.)(1990).Readingsin MachineLearn-ing. MorganKaufmann.

SHEPPARD, B. (1999,November/December).MasteringScrabble.IEEEIntelligentSystems14(6), 15–16.ResearchNote.

SIMON, H. A. & M. BARENFELD (1969).Information-processinganalysisof per-ceptualprocessesin problemsolving.PsychologicalReview 76(5),473–483.

SIMON, H. A. & K. GILMARTIN (1973).A simulationof memoryfor chessposi-tions.CognitivePsychology5, 29–46.

SLATE, D. J. (1987).A chessprogramthatusesits transpositiontableto learnfromexperience.InternationalComputerChessAssociationJournal 10(2), 59–71.

SLATE, D. J. & L. R. ATKIN (1983).Chess4.5 — the northwesternuniversitychessprogram.In ChessSkill in Man andMachine (2 ed.).,Chapter4, pp.82–118.Springer-Verlag.

52

Page 53: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

SMITH, S. F. (1983).Flexible learningof problemsolvingheuristicsthroughadap-tivesearch.In Proceedingsof the8thInternationalJoint ConferenceonArtificialIntelligence(IJCAI-83), LosAltos, CA, pp.421–425.MorganKaufmann.

SOMMERLUND, P. (1996,May). Artificial neuralnetsappliedto strategic games.

STOUTAMIRE, D. (1991). Machinelearning,gameplay, and Go. TechnicalRe-portTR-91-128,Centerfor AutomationandIntelligentSystemsResearch,CaseWesternReserve University.

SUTTON, R. S. (1988).Learningto predictby themethodsof temporaldifferences.MachineLearning3, 9–44.

SUTTON, R. S. & A. BARTO (1998).ReinforcementLearning: An Introduction.Cambridge,MA: MIT Press.

TADEPALLI , P. V. (1989).Lazy explanation-basedlearning: A solutionto the in-tractabletheoryproblem.In Proceedingsof the11thInternationalJoint Confer-enceonArtificial Intelligence(IJCAI-89), pp.694–700.MorganKaufmann.

TESAURO, G. (1989a).Connectionistlearningof expert preferencesby compari-sontraining.In D. Touretzky (Ed.),Advancesin Neural InformationProcessingSystems1 (NIPS-88), pp.99–106.MorganKaufmann.

TESAURO, G. (1989b).Neurogammon:a neural-networkbackgammonlearningprogram.In D. N. L. Levy andD. F. Beal(Eds.),HeuristicProgrammingin Ar-tificial Intelligence:TheFirst ComputerOlympiad, pp.78–80.Ellis Horwood.

TESAURO, G. (1990).Neurogammon:A neuralnetworkbackgammonprogram.InProceedingsof theInternationalJoint ConferenceonNeural Networks(IJCNN-90), VolumeIII, SanDiego,CA, pp.33–39.IEEE.

TESAURO, G. (1992a).Practicalissuesin temporaldifferencelearning.MachineLearning8, 257–278.

TESAURO, G. (1992b).Temporaldifferencelearningof backgammonstrategy. Pro-ceedingsof the9th InternationalConferenceonMachineLearning, 451–457.

TESAURO, G. (1994). TD-Gammon, a self-teachingbackgammonprogram,achievesmaster-level play. Neural Computation6(2),215–219.

TESAURO, G. (1995, March). Temporaldifferencelearning and TD-Gammon.Communicationsof theACM 38(3), 58–68.

TESAURO, G. (1998).Commentson ’Co-evolution in the successfullearningofbackgammonstrategy’. MachineLearning32(3), 241–243.

TESAURO, G. (2001). Comparisontraining of chessevaluation functions. InJ. FurnkranzandM. Kubat(Eds.),Machinesthat Learnto Play Games, Chap-ter6, pp.117–130.Huntington,NY: Nova SciencePublishers.To appear.

TESAURO, G. & T. J. SEJNOWSKI (1989).A parallelnetworkthat learnsto playbackgammon.Artificial Intelligence39, 357–390.

THOMPSON, K. (1996).6-pieceendgames.InternationalComputerChessAssoci-ationJournal19(4), 215–226.

53

Page 54: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

THRUN, S. (1995).Learningto playthegameof chess.In G. Tesauro,D. Touretzky,andT. Leen(Eds.),Advancesin Neural InformationProcessingSystems7, pp.1069–1076.Cambridge,MA: TheMIT Press.

TIGGELEN, A. V. (1991). Neural networks as a guide to optimization. Thechessmiddle gameexplored.InternationalComputerChessAssociationJour-nal 14(3),115–118.

TUNSTALL-PEDOE, W. (1991). Geneticalgorithmsoptimizing evaluation func-tions.InternationalComputerChessAssociationJournal 14(3),119–128.

UITERWIJK , J. W. H. M. & H. J. VAN DEN HERIK (1994).Speculative play incomputerchess.In H. J. vandenHerik andJ. W. H. M. Uiterwijk (Eds.),Ad-vancesin ComputerChess7, pp. 79–90.Maastricht,Holland: RijksuniversiteitLimburg.

UTGOFF, P. E. (2001).Featureconstructionfor gameplaying.In J.FurnkranzandM. Kubat(Eds.),Machinesthat Learnto Play Games, Chapter7, pp.131–152.Huntington,NY: Nova SciencePublishers.To appear.

UTGOFF, P. E. & J. CLOUSE (1991).Two kindsof traininginformationfor evalu-ationfunctionlearning.In Proceedingsof the9th NationalConferenceon Arti-ficial Intelligence(AAAI-91), Anaheim,CA, pp.596–600.AAAI Press.

UTGOFF, P. E. & P. S. HEITMAN (1988).Learningandgeneralizingmoveselectionpreferences.In H. Berliner(Ed.),Proceedingsof theAAAISpringSymposiumonComputerGamePlaying, StanfordUniversity, pp.36–40.

UTGOFF, P. E. & D. PRECUP (1998). Constructive function approximation.InH. Liu andH. Motoda(Eds.),Feature Extraction,ConstructionandSelection:A Data Mining Perspective, Volume453of TheKluwer InternationalSeriesinEngineeringandComputerScience, Chapter14.Kluwer AcademicPublishers.

VAN DEN HERIK , H. J. (1988).InformaticaenhetMenselijkBlikveld. Maastricht,TheNetherlands:Universityof Limburg. In Dutch.

VERHOEF, T. F. & J. H. WESSELIUS (1987).Two-ply KRKN: SafelyovertakingQuinlan.InternationalComputerChessAssociationJournal 10(4),181–190.

V IGNERON, H. (1914).Robots.La Nature, 56–61.In French.Reprintedin Englishin (Levy 1986)and(Levy 1988).

WALCZAK , S. (1991).Predictingactionsfrom inductionon pastperformance.InL. BirnbaumandG. Collins (Eds.),Proceedingsof the8th InternationalWork-shoponMachineLearning(ML-91), pp.275–279.MorganKaufmann.

WALCZAK , S. (1992).Pattern-basedtacticalplanning.InternationalJournalof Pat-ternRecognitionandArtificial Intelligence6(5), 955–988.

WALCZAK , S. (1996).Improving openingbookperformancethroughmodelingofchessopponents.In Proceedingsof the 24th ACM AnnualComputerScienceConference, pp.53–57.

WALCZAK , S. & D. D. DANKEL I I (1993).Acquiringtacticalandstrategic knowl-edgewith a generalizedmethodfor chunkingof gamepieces.International

54

Page 55: Machine Learning in Games: A Survey Ucenie/oefai-tr-2000-31.pdf · Machine Learning in Games: A Survey Johannes Furnkranz¨ Austrian Research Institute for Artificial Intelligence

Journal of IntelligentSystems8(2), 249–270.Reprintedin K. Ford andJ.Brad-shaw (Eds.)KnowledgeAcquisitionAsModeling, New York: Wiley.

WATERMAN, D. A. (1970).A generalizationlearningtechniquefor automatingthelearningof heuristics.Artificial Intelligence1, 121–170.

WATKINS, C. J. & P. DAYAN (1992).Q-learning.MachineLearning8, 279–292.

WEILL, J.-C. (1994).How hardis thecorrectcodingof aneasyendgame.In H. J.vandenHerik, I. S.Herschberg, andJ.W. H. M. Uiterwijk (Eds.),AdvancesinComputerChess7, pp.163–176.Universityof Limburg.

WIDMER, G. (1996). Learningexpressive performance:The structure-level ap-proach.Journalof New MusicResearch 25(2), 179–205.

WILKINS, D. E. (1980). Using patternsand plans in chess.Artificial Intelli-gence14(3),165–203.

WILKINS, D. E. (1982).Usingknowledgetocontroltreesearchsearching.ArtificialIntelligence18(1), 1–51.

WINKLER, F.-G. & J. FURNKRANZ (1998,March).A hypothesisonthedivergenceof AI research.InternationalComputerChessAssociationJournal21(1), 3–13.

WITTEN, I . H. & E. FRANK (2000).Data Mining — Practical MachineLearningTools andTechniqueswith Java Implementations. MorganKaufmannPublish-ers.

WOLFF, A. S., D. H. M ITCHELL, & P. W. FREY (1984).Perceptualskill in thegameof Othello.Journal of Psychology118, 7–16.

ZOBRIST, A. L. & F. R. CARLSON (1973,June).An advice-takingchesscomputer.ScientificAmerican228(6), 92–105.

55