reinforcement learning with dnns: alphago to alphazerocraven/cs760/lectures/alphazero.pdf · higher...

28
Reinforcement Learning with DNNs: AlphaGo to AlphaZero CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1

Upload: domien

Post on 25-Aug-2019

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

ReinforcementLearningwithDNNs:AlphaGo toAlphaZero

CS760:MachineLearningSpring2018

MarkCravenandDavidPage

www.biostat.wisc.edu/~craven/cs760

1

Page 2: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

GoalsfortheLecture

• Youshouldunderstandthefollowingconcepts:

• MonteCarlotreesearch(MCTS)• Self-play• Residualneuralnetworks• AlphaZero algorithm

2

Page 3: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

ABriefHistoryofGame-PlayingasaCS/AITestofProgress• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

Page 4: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

Page 5: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark

• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)

• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)

• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)

• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning

• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)

• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)

• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)

• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi

Page 6: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

Background:GamePlaying

• Untillastyear,state-of-the-artformanygamesincludingchesswasminimax searchwithalpha-betapruning(recallIntrotoAI)

• Mosttop-performinggame-playingprogramsdidn’tdolearning

• GameofGowasoneofthefewgameswherehumansstilloutperformedcomputers

Page 7: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

MinimaxinaPicture(thanksWikipedia)

Page 8: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

MonteCarloTreeSearch(MCTS)inaPicture(thanksWikipedia)

Rollout(RandomSearch)

Page 9: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

ReinforcementLearningbyAlphaGo,AlphaGo Zero,andAlphaZero:KeyInsights

• MCTSwithSelf-Play• Don’thavetoguesswhatopponentmightdo,so…• Ifnoexploration,abig-branchinggametreebecomesonepath• Yougetanautomaticallyimproving,evenly-matchedopponentwhoisaccuratelylearningyourstrategy

• “Wehavemettheenemy,andheisus”(famousvariantofPogo,1954)• Noneedforhumanexpertscoringrulesforboardsfromunfinishedgames

• Treatboardasanimage:useresidualconvolutionalneuralnetwork

• AlphaGo Zero:Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel

• AlphaZero:RemovedrolloutaltogetherfromMCTSandjustusedcurrentneuralnetestimatesinstead

Page 10: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero (Dec2017):MinimizedRequiredGameKnowledge,ExtendedfromGotoChessandShogi

Page 11: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero’s versionofQ-Learning

• Nodiscountonfuturerewards

• Rewardsof0untilendofgame;thenrewardof-1or+1

• ThereforeQ-valueforanactiona orpolicyp fromastateS isexactlyvaluefunction:Q(S,p) =V(S,p)

• AlphaZero usesoneDNN(detailsinabit)tomodelbothp andV

• UpdatestoDNNaremade(trainingexamplesprovided)aftergame

• Duringgame,needtobalanceexploitationandexploration

Page 12: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero Algorithm

InitializeDNN!"RepeatForever

PlayGame Update"

PlayGame:

RepeatUntilWinorLose: FromcurrentstateS,performMCTSEstimatemoveprobabilities#byMCTSRecord(S,#)asanexampleRandomlydrawnextmovefrom#

Update ": Letzbepreviousgameoutcome(+1or-1)Samplefromlastgame’sexamples(S,#, &)TrainDNN!"onsampletogetnew"

Page 13: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero Play-Game

Page 14: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero TrainDNN

Page 15: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZeroMonteCarloTreeSearch(MTCS)

Page 16: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

WhyNeedMCTSAtAll?

• CouldalwaysmakemoveDNNsayshashighestQ:noexploration• CouldjustdrawmovefromDNN’spolicyoutput• PaperssayMCTSoutputprobabilityvectorp selectsstrongermovesthatjustdirectlyusingtheneuralnetwork’spolicyoutputitself(isthereapossiblelessonhereforself-drivingcarstoo??)• StillneedtodecidehowmanytimestorepeatMCTSsearch(game-specific)andhowtotradeoffexplorationandexploitationinMCTS…AlphaZero paperjustsayschoosemovewith“lowcount,highmoveprobability,andhighvalue”—AlphaGo papermorespecific:maximizeupperconfidencebound• Where𝝉 istemperature[1,2],andN𝝉(s,b)iscountoftimeactionbhasbeentakenfromstates,raisedtothepower1/𝝉,choose:

Page 17: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:InputNodesRepresentCurrentGameState,IncludinganyneededHistory

Page 18: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:OutputNodesRepresentPolicyandValueFunction

• Apolicyisaprobabilitydistributionoverallpossiblemovesfromastate,soneedunitstorepresentallpossiblemoves

• Chessismostcomplicatedtodescribemoves(thoughGoandShogihavehighernumbersofmovestoconsider),sohereisforChessmoves:• 8x8=64possiblestartingpositionsforamove• 56possibledestinationsforqueenmoves:8compassdirections{N,NE,E,SE,S,SW,W,NW}times7possiblemove-lengths

• Another17possibledestinationsforirregularmovessuchasknight• Somemovesimpossible,dependingontheparticularpieceataposition(e.g.,pawncan’tmakeallqueenmoves)andlocationofotherpieces(queencan’tmovethrough2otherpiecestoattackathird)

• Weightsforimpossiblemovesaresetto0andnotallowedtochange• Anotherlayertonormalizeresultsintoprobabilitydistribution

• Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel:oneadditionaloutputnodeforthevalue function,whichestimatestheexpectedoutcomeintherange[-1,1]forfollowingthecurrentpolicyfrompresent(input)state

Page 19: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

DeepNeuralNetworksTrick#9:ResNets(ResidualNetworks)

• Whatifyourneuralnetworkistoodeep?

• Intheory,that’snoproblem,givensufficientnodesandconnectivity:early(orlate)layerscanjustlearnidentityfunction(autoencoder)

• Inpracticedeepneuralnetworksfailtolearnidentitywhenneeded

• Asolution:makeidentityeasyoreventhedefault;havetoworkhardtoactuallylearnanon-zeroresidual(andhenceanon-identity)

Page 20: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

ResidualNetworkinaPicture(He,Zhang,Ren,Sun,2015):IdentitySkipConnection

Note:outputandinputdimensionalityneedtobethesame.

Whycalled“residual”?

Page 21: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

DeepResidualNetworks(ResNets):Startofa35-layerResNet (He,Zhang,Ren,Sun,2015)

DottedlinedenotesincreaseinDimension(2moresuchincreases)

Page 22: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

ABriefAside:LeakyReLUs

• RectifiersusedcouldbeReLU or”LeakyReLU”

• LeakyReLU addresses“dyingReLU”problem-–wheninputsumisbelowsomevalue,outputis0,sonogradientfortraining

• ReLU:f(x)=max(0,x)

• LeakyReLU:

• ReLU LeakyReLU

Page 23: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:HiddenUnitsArrangedinaResidualNetwork(aCNNwithResidualLayers)

PolicyHead

ConvBlock3x3,256,/1

ResBlock3x3,256,/1

ResBlock3x3,256,/1

... Repeatfor39ResBlocks

ValueHead

Page 24: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:ConvolutionBlock

Page 25: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:ResidualBlocks

Page 26: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:PolicyHead(forGo)

Page 27: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero DNNArchitecture:ValueHeadAlphaZero DNNArchitecture:ValueHead

Page 28: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible

AlphaZero ComparedtoRecentWorldChampions