reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...
TRANSCRIPT
REINFORCEMENT LEARNING IN ROBOTICS
AN INTRODUCTION TO
Nemec Bojan, Jozef Stefan institute
Humanoid robots
A humanoid robot is a robot with appearance similar to the human body, allowing interaction with made-for-human tools or environments
self-maintenance autonomous learning avoiding harmful situations to people, property, and itself safe interacting with human beings and the environment
A humanoid robot is an autonomous robot that can adapt to changes in its environment or itself
Supervised Learning
•learning approaches to regression & classification, neural networks
•learning from examples, learning from a teacher
Unsupervised Learning
•Reinforcement Learning
•learning approaches to sequential decision making
•learning from a critic, learning from delayed reward
Robot learning
Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.
• environment states S
• actions A
• rewards R.
State S
• value V
• action-value Q.
Reinforcement learning
S0
S1
S2
111 ,,,, ttttttt xxxxxx
1 ttt xx
is MDP
is NOT MDP
Agent – driver Environment – car on a road States – car pos,vel,acc Action – stearing wheel angle Action is result of a policy Reward – success or fail
xxx ,,
Markov decision process (MDP) - A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. The outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).
The environment is typically formulated as a finite-state Markov decision process
The basic reinforcement learning model applied to MDPs consists of:
.
• a set of environment states S;
• a set of actions A; and
• a set of scalar "rewards" in R.
At each time t, the agent perceives its state st ϵ S and the set of possible actions A(st). It chooses an action and receives from the environment the new state st + 1 and a reward rt. Based on these interactions, the reinforcement learning agent must develop a policy π : S →A which maximizes the quantity R = r0 + r1 + .. +rn for MDPs which have a terminal state, or the quantity
t
t
trR
for MDP witout terminal states.
A policy π determines which action should be performed in each state; a policy is a mapping from states to actions
Action Value Function
The value of a state is defined as the sum of the reinforcements received when starting in that state and following some fixed policy to a terminal state
0
1 ||)(k
tkt
k
tt ssrEssREsV
Value Function
0
1 ,|,|),(k
ttkt
k
ttt aassrEaassREasQ
Optimal value function V* Is a value function following optimal policy
π* : S →A , maximizing V (Q)
Value function update methods
Monte Carlo Method
Repeat forever Choose a random policy Generate an entire episode For each state s appearing in episode, compute return R = R + r(s) Value function V(s) is average return
)()()( tttt sVRsVsV
Drawback : can only be updated after R is calculated from a complete simulation run, can be applied for episodic tasks
learning rate
R
V
Temporal difference learning
is a prediction method. TD learning is a combination of Monte Carlo ideas and dinamic programming (DP) ideas. Neuroscience researches have proved the existence of this method in human and animal brains.
The TD method updates value or action value function immediately after visits to new states.
)()()()( 11 ttttt sVsVrsVsV
Benefits : suitable for non-episodic tasks, suitable for on-line implementation, better convergence
V
V
V
V
Eglibility traces – TD(λ)
Eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. They are a bridge from TD to Monte Carlo methods
repeat for all states in each step
V
V
V
V
)()()()()( 11 sesVsVrsVsV ttttt
tt
tt
tssifse
ssifsese
1)(
)()(
1
1
update algorithm :
Random walk example
0 1
1 2 3 4 50
0.2
0.4
0.6
0.8
1RANDOM WALK - VALUE FUNCTION AFTER 10 STEPS
V
states
1 2 3 4 50
0.2
0.4
0.6
0.8
1RANDOM WALK - VALUE FUNCTION AFTER 100 STEPS
V
states
True
TD
ET
MC1
MC2
Direct Q-Learning methods
SARSA
),(),(),(),( 111 ttttttttt asQasQrasQasQ
Because of choosing actual future value for state action update, we say that SARSA learns the Q values associated with taking the policy it follows itself. (on-Policy method)
Example : ε-greedy policy
)(
),(max
arand
asQa exploitation, exploration
Q-Learning
),(),(max),(),( 11 tttttttt asQasQrasQasQ
Because of choosing max future value for state action update, we say that Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. (Off-Policy method)
Cliff – walking problem
Q-Learning
SARSA
R=-100
R=-1
Case study – Ball in a cup game
Goal : Swing ball on a rope to the desired height at 0 velocity
State variables
hand position and velocity, rope angle and angular velocity
State boxing
x1 = [0.3 0]; x2 = [-1 0 1]; x3 = [0 : 360]; 18 values x4 = [-1.5 0 1.5]; Number of states = 3*3*18*3 = 342 Reward function
if (x > 0.65) || (x < -0.05) r = -500 - 5*abs(x_dot); elseif ((abs(theta-D1)<0.2) &&(abs(theta_dot) < 0.3)) r = 1000; else r = theta^2; end
Learning algorithm
Q-Learning, α = 0.4 γ = 0.99 λ = 0.9
States discrete representation of continuous problem
,,, xx
Actions – acceleration [-1,-0.5,0,+0.5,+1]
Simulation results of Q Learning
Experimental result of Q Learning
A central issue in the trajectory generation and modification is the choice of the representation (or encoding) of the trajectory
Dynamic Motion Primitives
Second order system + Trajectory modulation
Goal (g) Kernel function weights (w)
Discrete time Continous time
Trajectory representation
Second order system Trajectory modulation
Time evolution
Learning of Dynamic Motion Primitives can be accomplished by regression
REGRESION
Dynamic Motion Primitives – DMP-s
{
Policy gradient methods
Policy gradient methods are a type of reiforcement learning techniques that rely upon optimizing parameterized policies with respect to the expected return
}{)(0 k
h
k krEJ }max{J
- dicsount factor [0,1] r - reward
update using gradient update rule
hJhhh 1
- learning rate
local maximum !
Gradient estimation technique
Finite-difference Methods
ihih ,Random variantion of parameters ,
Experimental evaluation of J refihih JJJ )(,
Gradient estimation using regression hhh
T
hh JJ ])[( 1
hJhkh 1
Parameters update
khhhh ,2,1, ...
episode
transition
Liquid pouring learning - simulation
one transition
J estimation using balance
Experimental results
Goal learning