reinforcement learning with dnns: alphago to alphazerocraven/cs760/lectures/alphazero.pdf · higher...
TRANSCRIPT
ReinforcementLearningwithDNNs:AlphaGo toAlphaZero
CS760:MachineLearningSpring2018
MarkCravenandDavidPage
www.biostat.wisc.edu/~craven/cs760
1
GoalsfortheLecture
• Youshouldunderstandthefollowingconcepts:
• MonteCarlotreesearch(MCTS)• Self-play• Residualneuralnetworks• AlphaZero algorithm
2
ABriefHistoryofGame-PlayingasaCS/AITestofProgress• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
Background:GamePlaying
• Untillastyear,state-of-the-artformanygamesincludingchesswasminimax searchwithalpha-betapruning(recallIntrotoAI)
• Mosttop-performinggame-playingprogramsdidn’tdolearning
• GameofGowasoneofthefewgameswherehumansstilloutperformedcomputers
MinimaxinaPicture(thanksWikipedia)
MonteCarloTreeSearch(MCTS)inaPicture(thanksWikipedia)
Rollout(RandomSearch)
ReinforcementLearningbyAlphaGo,AlphaGo Zero,andAlphaZero:KeyInsights
• MCTSwithSelf-Play• Don’thavetoguesswhatopponentmightdo,so…• Ifnoexploration,abig-branchinggametreebecomesonepath• Yougetanautomaticallyimproving,evenly-matchedopponentwhoisaccuratelylearningyourstrategy
• “Wehavemettheenemy,andheisus”(famousvariantofPogo,1954)• Noneedforhumanexpertscoringrulesforboardsfromunfinishedgames
• Treatboardasanimage:useresidualconvolutionalneuralnetwork
• AlphaGo Zero:Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel
• AlphaZero:RemovedrolloutaltogetherfromMCTSandjustusedcurrentneuralnetestimatesinstead
AlphaZero (Dec2017):MinimizedRequiredGameKnowledge,ExtendedfromGotoChessandShogi
AlphaZero’s versionofQ-Learning
• Nodiscountonfuturerewards
• Rewardsof0untilendofgame;thenrewardof-1or+1
• ThereforeQ-valueforanactiona orpolicyp fromastateS isexactlyvaluefunction:Q(S,p) =V(S,p)
• AlphaZero usesoneDNN(detailsinabit)tomodelbothp andV
• UpdatestoDNNaremade(trainingexamplesprovided)aftergame
• Duringgame,needtobalanceexploitationandexploration
AlphaZero Algorithm
InitializeDNN!"RepeatForever
PlayGame Update"
PlayGame:
RepeatUntilWinorLose: FromcurrentstateS,performMCTSEstimatemoveprobabilities#byMCTSRecord(S,#)asanexampleRandomlydrawnextmovefrom#
Update ": Letzbepreviousgameoutcome(+1or-1)Samplefromlastgame’sexamples(S,#, &)TrainDNN!"onsampletogetnew"
AlphaZero Play-Game
AlphaZero TrainDNN
AlphaZeroMonteCarloTreeSearch(MTCS)
WhyNeedMCTSAtAll?
• CouldalwaysmakemoveDNNsayshashighestQ:noexploration• CouldjustdrawmovefromDNN’spolicyoutput• PaperssayMCTSoutputprobabilityvectorp selectsstrongermovesthatjustdirectlyusingtheneuralnetwork’spolicyoutputitself(isthereapossiblelessonhereforself-drivingcarstoo??)• StillneedtodecidehowmanytimestorepeatMCTSsearch(game-specific)andhowtotradeoffexplorationandexploitationinMCTS…AlphaZero paperjustsayschoosemovewith“lowcount,highmoveprobability,andhighvalue”—AlphaGo papermorespecific:maximizeupperconfidencebound• Where𝝉 istemperature[1,2],andN𝝉(s,b)iscountoftimeactionbhasbeentakenfromstates,raisedtothepower1/𝝉,choose:
AlphaZero DNNArchitecture:InputNodesRepresentCurrentGameState,IncludinganyneededHistory
AlphaZero DNNArchitecture:OutputNodesRepresentPolicyandValueFunction
• Apolicyisaprobabilitydistributionoverallpossiblemovesfromastate,soneedunitstorepresentallpossiblemoves
• Chessismostcomplicatedtodescribemoves(thoughGoandShogihavehighernumbersofmovestoconsider),sohereisforChessmoves:• 8x8=64possiblestartingpositionsforamove• 56possibledestinationsforqueenmoves:8compassdirections{N,NE,E,SE,S,SW,W,NW}times7possiblemove-lengths
• Another17possibledestinationsforirregularmovessuchasknight• Somemovesimpossible,dependingontheparticularpieceataposition(e.g.,pawncan’tmakeallqueenmoves)andlocationofotherpieces(queencan’tmovethrough2otherpiecestoattackathird)
• Weightsforimpossiblemovesaresetto0andnotallowedtochange• Anotherlayertonormalizeresultsintoprobabilitydistribution
• Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel:oneadditionaloutputnodeforthevalue function,whichestimatestheexpectedoutcomeintherange[-1,1]forfollowingthecurrentpolicyfrompresent(input)state
DeepNeuralNetworksTrick#9:ResNets(ResidualNetworks)
• Whatifyourneuralnetworkistoodeep?
• Intheory,that’snoproblem,givensufficientnodesandconnectivity:early(orlate)layerscanjustlearnidentityfunction(autoencoder)
• Inpracticedeepneuralnetworksfailtolearnidentitywhenneeded
• Asolution:makeidentityeasyoreventhedefault;havetoworkhardtoactuallylearnanon-zeroresidual(andhenceanon-identity)
ResidualNetworkinaPicture(He,Zhang,Ren,Sun,2015):IdentitySkipConnection
Note:outputandinputdimensionalityneedtobethesame.
Whycalled“residual”?
DeepResidualNetworks(ResNets):Startofa35-layerResNet (He,Zhang,Ren,Sun,2015)
DottedlinedenotesincreaseinDimension(2moresuchincreases)
ABriefAside:LeakyReLUs
• RectifiersusedcouldbeReLU or”LeakyReLU”
• LeakyReLU addresses“dyingReLU”problem-–wheninputsumisbelowsomevalue,outputis0,sonogradientfortraining
• ReLU:f(x)=max(0,x)
• LeakyReLU:
• ReLU LeakyReLU
AlphaZero DNNArchitecture:HiddenUnitsArrangedinaResidualNetwork(aCNNwithResidualLayers)
PolicyHead
ConvBlock3x3,256,/1
ResBlock3x3,256,/1
ResBlock3x3,256,/1
... Repeatfor39ResBlocks
ValueHead
AlphaZero DNNArchitecture:ConvolutionBlock
AlphaZero DNNArchitecture:ResidualBlocks
AlphaZero DNNArchitecture:PolicyHead(forGo)
AlphaZero DNNArchitecture:ValueHeadAlphaZero DNNArchitecture:ValueHead
AlphaZero ComparedtoRecentWorldChampions