reinforcement learning with dnns: alphago to alphazerocraven/cs760/lectures/alphazero.pdf · higher...

Post on 25-Aug-2019

233 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ReinforcementLearningwithDNNs:AlphaGo toAlphaZero

CS760:MachineLearningSpring2018

MarkCravenandDavidPage

www.biostat.wisc.edu/~craven/cs760

1

GoalsfortheLecture

• Youshouldunderstandthefollowingconcepts:

• MonteCarlotreesearch(MCTS)• Self-play• Residualneuralnetworks• AlphaZero algorithm

2

ABriefHistoryofGame-PlayingasaCS/AITestofProgress• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

Background:GamePlaying

• Untillastyear,state-of-the-artformanygamesincludingchesswasminimax searchwithalpha-betapruning(recallIntrotoAI)

• Mosttop-performinggame-playingprogramsdidn’tdolearning

• GameofGowasoneofthefewgameswherehumansstilloutperformedcomputers

MinimaxinaPicture(thanksWikipedia)

MonteCarloTreeSearch(MCTS)inaPicture(thanksWikipedia)

Rollout(RandomSearch)

ReinforcementLearningbyAlphaGo,AlphaGo Zero,andAlphaZero:KeyInsights

• MCTSwithSelf-Play• Don’thavetoguesswhatopponentmightdo,so…• Ifnoexploration,abig-branchinggametreebecomesonepath• Yougetanautomaticallyimproving,evenly-matchedopponentwhoisaccuratelylearningyourstrategy

• “Wehavemettheenemy,andheisus”(famousvariantofPogo,1954)• Noneedforhumanexpertscoringrulesforboardsfromunfinishedgames

• Treatboardasanimage:useresidualconvolutionalneuralnetwork

• AlphaGo Zero:Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel

• AlphaZero:RemovedrolloutaltogetherfromMCTSandjustusedcurrentneuralnetestimatesinstead

AlphaZero (Dec2017):MinimizedRequiredGameKnowledge,ExtendedfromGotoChessandShogi

AlphaZero’s versionofQ-Learning

• Nodiscountonfuturerewards

• Rewardsof0untilendofgame;thenrewardof-1or+1

• ThereforeQ-valueforanactiona orpolicyp fromastateS isexactlyvaluefunction:Q(S,p) =V(S,p)

• AlphaZero usesoneDNN(detailsinabit)tomodelbothp andV

• UpdatestoDNNaremade(trainingexamplesprovided)aftergame

• Duringgame,needtobalanceexploitationandexploration

AlphaZero Algorithm

InitializeDNN!"RepeatForever

PlayGame Update"

PlayGame:

RepeatUntilWinorLose: FromcurrentstateS,performMCTSEstimatemoveprobabilities#byMCTSRecord(S,#)asanexampleRandomlydrawnextmovefrom#

Update ": Letzbepreviousgameoutcome(+1or-1)Samplefromlastgame’sexamples(S,#, &)TrainDNN!"onsampletogetnew"

AlphaZero Play-Game

AlphaZero TrainDNN

AlphaZeroMonteCarloTreeSearch(MTCS)

WhyNeedMCTSAtAll?

• CouldalwaysmakemoveDNNsayshashighestQ:noexploration• CouldjustdrawmovefromDNN’spolicyoutput• PaperssayMCTSoutputprobabilityvectorp selectsstrongermovesthatjustdirectlyusingtheneuralnetwork’spolicyoutputitself(isthereapossiblelessonhereforself-drivingcarstoo??)• StillneedtodecidehowmanytimestorepeatMCTSsearch(game-specific)andhowtotradeoffexplorationandexploitationinMCTS…AlphaZero paperjustsayschoosemovewith“lowcount,highmoveprobability,andhighvalue”—AlphaGo papermorespecific:maximizeupperconfidencebound• Where𝝉 istemperature[1,2],andN𝝉(s,b)iscountoftimeactionbhasbeentakenfromstates,raisedtothepower1/𝝉,choose:

AlphaZero DNNArchitecture:InputNodesRepresentCurrentGameState,IncludinganyneededHistory

AlphaZero DNNArchitecture:OutputNodesRepresentPolicyandValueFunction

• Apolicyisaprobabilitydistributionoverallpossiblemovesfromastate,soneedunitstorepresentallpossiblemoves

• Chessismostcomplicatedtodescribemoves(thoughGoandShogihavehighernumbersofmovestoconsider),sohereisforChessmoves:• 8x8=64possiblestartingpositionsforamove• 56possibledestinationsforqueenmoves:8compassdirections{N,NE,E,SE,S,SW,W,NW}times7possiblemove-lengths

• Another17possibledestinationsforirregularmovessuchasknight• Somemovesimpossible,dependingontheparticularpieceataposition(e.g.,pawncan’tmakeallqueenmoves)andlocationofotherpieces(queencan’tmovethrough2otherpiecestoattackathird)

• Weightsforimpossiblemovesaresetto0andnotallowedtochange• Anotherlayertonormalizeresultsintoprobabilitydistribution

• Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel:oneadditionaloutputnodeforthevalue function,whichestimatestheexpectedoutcomeintherange[-1,1]forfollowingthecurrentpolicyfrompresent(input)state

DeepNeuralNetworksTrick#9:ResNets(ResidualNetworks)

• Whatifyourneuralnetworkistoodeep?

• Intheory,that’snoproblem,givensufficientnodesandconnectivity:early(orlate)layerscanjustlearnidentityfunction(autoencoder)

• Inpracticedeepneuralnetworksfailtolearnidentitywhenneeded

• Asolution:makeidentityeasyoreventhedefault;havetoworkhardtoactuallylearnanon-zeroresidual(andhenceanon-identity)

ResidualNetworkinaPicture(He,Zhang,Ren,Sun,2015):IdentitySkipConnection

Note:outputandinputdimensionalityneedtobethesame.

Whycalled“residual”?

DeepResidualNetworks(ResNets):Startofa35-layerResNet (He,Zhang,Ren,Sun,2015)

DottedlinedenotesincreaseinDimension(2moresuchincreases)

ABriefAside:LeakyReLUs

• RectifiersusedcouldbeReLU or”LeakyReLU”

• LeakyReLU addresses“dyingReLU”problem-–wheninputsumisbelowsomevalue,outputis0,sonogradientfortraining

• ReLU:f(x)=max(0,x)

• LeakyReLU:

• ReLU LeakyReLU

AlphaZero DNNArchitecture:HiddenUnitsArrangedinaResidualNetwork(aCNNwithResidualLayers)

PolicyHead

ConvBlock3x3,256,/1

ResBlock3x3,256,/1

ResBlock3x3,256,/1

... Repeatfor39ResBlocks

ValueHead

AlphaZero DNNArchitecture:ConvolutionBlock

AlphaZero DNNArchitecture:ResidualBlocks

AlphaZero DNNArchitecture:PolicyHead(forGo)

AlphaZero DNNArchitecture:ValueHeadAlphaZero DNNArchitecture:ValueHead

AlphaZero ComparedtoRecentWorldChampions

top related