reinforcement learning for cps safety engineeringkoclab.cs.ucsb.edu/cpsed/files/green1.pdf ·...

Post on 22-Sep-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ReinforcementLearningforCPSSafetyEngineering

SamGreen,Çetin KayaKoç,Jieliang LuoUniversityofCalifornia,SantaBarbara

Motivations

Safety-criticaldutiesdesiredbyCPS?

• Autonomousvehiclecontrol:UAV,passengervehicles,deliverytrucks• Automaticallyrespondingto,orpreventing,damage• Industrialrobotcontrolforusearoundhumans• Largeprocessautomation• E.g.,optimizationoffactory

ReinforcementLearning

GeorgiaTech,https://www.youtube.com/watch?v=f2at-cqaJMM

Deepmind,https://arxiv.org/abs/1707.02286

MachineLearning

Supervised Unsupervised Reinforcement

IntroductiontoRL

• Acomputationalapproachtolearningfrominteraction• Establishedinthe1980s• Objectiveistotakeactionstomaximizeareward(orminimizeacost)• SeenasapathtowardArtificialGeneralIntelligence

• RLisattheintersectionbetween• Psychology• ControlTheory• ComputerScience/AI

• Resurgencewithadventofdeeplearningmethods

[Mnih,etal.AsynchronousMethodsforDeepReinforcementLearning,2016]

AdvancesinRLsince2015

20152015201520152015201620162016

Terminology

• Agent – Thethingwearelearningtocontrol• Environment – Allthefactorsaffectingtheagent• Action – Performedbyagentinanattempttoaffectchangeontheenvironment• Reward – Returnedbytheenvironmenttotheagentaftertheagentmakesanaction.Usedtohelptheagentlearn.• AKAthenegativecost

[R.Sutton,andA.Barto.ReinforcementLearning:AnIntroduction.2016]

MarkovDecisionProcess

• WhatRLsolves• Environmentswhereagent’sdecisionsareonlydependentonpresent• Anobjectinflight• Self-drivingcar• Manufacturingprocess• Robotcontrol

• It’snotthatthepastdoesn’tmatter,butthelawsofphysicsguaranteecertainthings,e.g.momentum• MethodsalsoexisttosolveapproximateMDP

Example:StudentMarkovChain

[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf]

Starthereatthebeginningofeachepisode

RLforCPSSafetyEngineering

• InterdisciplinarynaturesmakesRLinterestingforCPSengineering• AI,ML(Math,Statistics)• Mechanicsdesignandsimulation(ME,Physics,CS)• Programmingandimplementation(CS,EE)

MountainCarExample

• Agentisanunderpoweredcarwith3actions:• Backward,Neutral,Forward

• Reward:=-1pertimestep• Implicitgoal:=Reachtheflagasfastaspossible

• State:=x-pos andvelocity

Canonicalexample:MountainCar

[R.Sutton,andA.Barto.ReinforcementLearning:AnIntroduction.2016]

Model-FreeControlviaPolicy-BasedRL• Asimplephysicsmodeldeterminesthebehaviorofcar• Capturespositionofthecaronthehill• Captureseffectoflimitedenginepower

• Usingaphysicsmodelsimplifiesapproach• Useanefficienttraditionalcontroller

• Butinmanyscenariosthemodelisnotavailableortoocomplex• Amazonpackagedeliverydrone

• Solvemountaincarusingsophisticatedmethodastoyexample• Directlytrainaneuralnetwork-basedpolicy

RLTerminologyandNotation

• 𝑆𝑡 – Stateoftheenvironmentattime𝑡• x-axispositionandvelocity

• 𝐴𝑡 – Actiontakenbyagentattime𝑡• Backward,Neutral,Forward

• 𝜋 – Thepolicyfunction;returnsthenextactiontotake.Stochasticinthisexample• 𝜃– Aparametervectorforthepolicy;i.e.theweightslearnedinaneuralnetwork

Puttingeverythingtogether:𝐴'()~𝜋𝜃 𝐴𝑡,𝑆𝑡 = 𝑃(𝐴𝑡|𝑆𝑡, 𝜃)

Thepolicy𝜋𝜃• 𝜋𝜃 isoftenapproximated• Deepneuralnetworksarepowerforapproximation• WewillusegradientascenttooptimizetheDNN

Thepolicyfunction𝜋𝜃,approximatedbyNN

• Stateinformationattime𝑡:• PositionandVelocity

• Actionoptionsattime𝑡:• Forwardacceleration• Neutral• Backwardacceleration

PositionVelocity

Input Output

𝜋𝜃Prob(F)Prob(N)Prob(B)

Rewardfunction• Ateverytimesteptakeanaction• Forward,neutral,backward• Eachactionhasarewardof-1• Trainagenttoreachtheflaginminimumtimesteps

Example:MarkovRewardProcess

[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf]

Starthereatthebeginningofeachepisode

HowtotraintheNN?

• Smallnetworkscanbeeffectivelytrainedwithgeneticalgorithms• Geneticalgorithmsworkpoorlywithlargenetworks(parameterspaceistoolarge)• Gradient-ascentoptimizationworkswithlargeparameterspace Position

Velocity

Prob(F)Prob(N)Prob(B)

𝜋𝜃

Monte-CarloPolicyGradient(REINFORCE)

• FindDNNparametervector𝜃 suchthat𝜋𝜃 maximizesthereward• Foreveryepisode,untilflagisreached• Getstateinformation(position&velocity)fromenvironment• FeedNNwithstateinformation• NNwilloutputaprobabilityfor(F)orward,(N)eutral,and(B)ackward• RandomlyselectactionF,N,andB(usingtheaboveprobabilities)• Storethestateinformationandactiontaken

• Onceflagisreached• Assignthemostrewardtothelastaction…leastrewardtothefirstaction• Update𝜃 s.t. actionsmadeattheendaremoreprobable

[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html]

Monte-CarloPolicyGradient

• Methodleveragesmethodscreatedforsupervisedlearning• Inputs≔ thestateinformation(position,velocity)• Predictions:=forward,neutral,orbackwardactiontaken• Labels(“groundtruth”):=Aftertheepisodewasover,assignmostvaluetothelastactions.Assignleastvaluetothefirstactions

• Runmanyepisodes,aftereachepisodefinishes(flagisreached)strengthenthenetworksuchthatthelastmovesbecomemoreprobable

[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html]

Gradient-ascent

• Gradientalgorithmsfindalocalextremum• Atendofeachepisode,adjusteachparameterin𝜃 s.t. actionsmadeneartheendarestrengthened• Howmuchandinwhichdirectiontomoveeachparameterisdeterminedbythebackpropagationmethod

𝜃1𝜃2

EpisodeRewards

Caveats

• DeepRLisusuallyslowtolearn

• Transferringknowledgefromoneproblemtoanotherisdifficult

• Rewardfunctioncanbecomplex

SafetyandSecurityConsiderations

SafetyandSecurityConsiderations

• DNNsareblack-boxmodels• PossibletogiveaninputwhichcausesDNNtoprovidewildoutput

• Effortstomitigatethislimitation• E.g.ConstrainedPolicyOptimization

ConstrainedPolicyOptimization

• School-bookRLspecifiesonlytherewardfunction• Problem:whenanagentislearning,itmaytryanything• Potentiallyunsafewhentrainingisinphysicalenvironment

• Constraintscanbeaddedtotheobjectivefunction

[Achiam etal.“ConstrainedPolicyOptimization”,2017]

CurrentEfforts

DevelopingRLforQuadcopterControl• GoodcasestudyforcomplexautonomousCPS• Collisionavoidance• Targettracking• Packagedelivery

• Usingopensourcefirmwareandhardware

UsingMicrosoftAirSim for1st-orderlearning

[S.Shahetal.AirSim:High-FidelityVisualandPhysicalSimulationforAutonomousVehicles.2017.]

Conclusions

• RLisageneralizablemethodtotacklemanyCPSdecisionmakingproblems• High-capacitymodelscanmakesophisticateddecisions

• GoodapproachforCPSeducation,becauseofinterdisciplinarynature

• Openproblemswhenusingblack-boxfunctionsforsafetyapplications

Questions?

top related