asynchronous methods for deep reinforcement learning · asynchronous methods for deep reinforcement...

25
Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu Presented by: Pihel Saatmann

Upload: others

Post on 21-May-2020

31 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous Methods for Deep Reinforcement Learning PaperbyVolodymyrMnih,AdriàPuigdomènechBadia,MehdiMirza,

AlexGraves,TimothyP.Lillicrap,TimHarley,DavidSilver,KorayKavukcuoglu

Presentedby:PihelSaatmann

Page 2: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Reinforcement learning

•  State–„snapshot“oftheenvironment

• Ac'on–leadstonewstate,someNmesreward

• Reward–Nmedelayed,sparse

• Policy–rulesforchoosingacNon

Page 3: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

So far

•  ThoughtthatonlineRLalgorithmswithdeepNN-sareunstable.• Problems-correlatedandnon-staNonaryinputdata.

•  Tocountertheseproblemsdatacanbestoredinexperiencereplaymemory.•  Thisusesmorememory/computaNonalpower.

• DeepRLmethodsrequirespecializedhardware(GPUs)ormassivedistributedarchitectures.

Page 4: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Q-learning

• AteachNmestept,theagentreceivesastatestandselectsanacNonaaccordingtoitspolicyπ.Thentheagentgetsthenextstatest+1andascalarrewardrt.•  Thegoalistomaximizetheexpectedreturnfromeachstatest.• QfuncNonesimatestheacNon’svalue.•  EachNmetheagentdoesanacNontheQvalueisupdated.• Off-policymethod–updaNngQfndoesnotdependonpolicy.

Page 5: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous RL framework

•  InsteadofexperiencereplaytheyasynchronouslyexecutemulNpleagentsinparallelonmulNpleinstancesoftheenvironment.

• Parallelactor-learnershaveastabilizingeffectontraining.• RunsonasinglemachinewithastandardmulN-coreCPU.

Page 6: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous RL framework II

• AsyncvariantsoffourstandardRLalgorithms:•  1-stepQ-learning•  N-stepQ-learning•  1-stepSarsa•  Advantageactor-criNc(A3C)

Page 7: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

1-step Q-learning

• NNisusedtoapproximatetheQ(s,a;Θ)funcNon.•  Theparameters(weights)ΘarelearnedbyiteraNvelyminimizingasequenceoflossfuncNons,wherethei-thlossfuncNonisdefinedas:

Page 8: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Async 1-step Q-leaning

•  Eachthreadhasowncopyofenvironment.• AteachstepcomputesagradientoftheQ-learningloss.• AccumulategradientsovermulNpleNmestepsbeforeapplying.•  Sharedandslowlychangingtargetnetwork.

Page 9: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous 1-step Sarsa

•  Sameas1-stepQ-learning,butusesadifferenttargetvalue:

Page 10: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous n-step Q-learning

• PotenNallyfasterwaytopropagaterewards.• Uses‘forward-view’-selectsacNonsusingitspolicyforuptonstepsinthefuture.• Receivesuptotmaxrewardssincelastupdate.•  Totalaccumulatedreturn:• ValuefnisupdatedadereverytmaxacNonsoraderterminalstate.•  Foreachupdateusesthelongestpossiblen-stepreturn.

Page 11: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,
Page 12: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Asynchronous advantage actor-criBc

• On-policymethod-hasapolicyandesNmatedvaluefuncNon.• Uses‘forward-view’.• Receivesuptotmaxrewardssincelastupdate.• Policyandvaluefn-sareupdatedadereverytmaxacNonsoraderterminalstate.•  Foreachupdateusesthelongestpossiblen-stepreturn.

Page 13: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,
Page 14: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Performance evaluaBon

•  Fourdifferentplaforms:•  Atari2600-differentgames•  TORCS3D-carracingsimulator• MuJoCo-physicssimulatorforconNnuousmotorcontrol(A3Conly)•  Labyrinth-findingrewardsinrandomlygenerated3Dmazes(A3Conly)

Page 15: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Atari 2600 games

• AllfourmethodscansuccessfullytrainNNcontrollers.• AsyncmethodsmostlyfasterthanDQN(DeepQ-Network).• Advantageactor-criNcwasthebest.

Page 16: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Async A3C on 57 atari games

Page 17: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

TORCS Car Racing Simulator

•  EvaluatedonlytheA3Calgorithm.• Agenthadtodrivearacecarusingonlyrawpixelsasinput.• Duringtraining,theagentwasrewardedformaintaininghighvelocityalongthecenteroftheracetrack.

hgps://youtu.be/0xo1Ldx3L5Q

Page 18: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,
Page 19: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

MuJoCo Physics Simulator

•  EvaluatedonlytheA3Calgorithm.• Rigidbodyphysicswithcontactdynamics.• ConNnuousacNons.•  InallproblemsA3CfoundgoodsoluNonsinlessthan24hoursoftraining(typicallyafewhours).

hgps://youtu.be/0xo1Ldx3L5Q

Page 20: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Labyrinth

•  Theagentwasplacedinrandommazeandhad60stocollectpoints.• Apples–1point• Portals–10points,respawnedapplesandagentinrandomlocaNons• Visualinputonly.

•  Theagentlearnedaresonablygoodgeneralstrategyforexploringrandommazes.

hgps://youtu.be/nMR5mjCFZCw

Page 21: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Scalability

•  Theframeworkscaleswellwiththenumberofparallelworkers.•  Evenshowssuperlinearspeedupsforsomemethods.

Page 22: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,
Page 23: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,
Page 24: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

Robustness and stability

•  Trainedmodelsonfivegamesusing50differentlearningratesandrandominiNalizaNon.•  EachgameandalgorithmcombinaNonhadarangeoflearningratesforwhichallrandominiNalizaNonsachievedgoodscores.•  Stabilityindicatedbyvirtuallyno0scoresinregionswithgoodlearningrates.

Page 25: Asynchronous Methods for Deep Reinforcement Learning · Asynchronous Methods for Deep Reinforcement Learning Paper by Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza,

To summarize

• Asynchronousmethodsforfourstandardreinforcementlearningalgorithms(1-stepQ,n-stepQ,1-stepSARSA,A3C).• Abletotrainneuralnetworkcontrollersonavarietyofdomainsinstablemanner.• Usingparallelactorlearnerstoupdateasharedmodelstabilizedthelearningprocess(alternaNivetoexperiencereplay).•  InAtarigamestheadvantageactor-criNc(A3C)surpassedthecurrentstate-of-the-artinhalfthetrainingNme.•  Superlinearspeedupwhenincreasingthreadcountfor1-stepmethods.