reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...

REINFORCEMENT LEARNING IN ROBOTICS

AN INTRODUCTION TO

Nemec Bojan, Jozef Stefan institute

Humanoid robots

A humanoid robot is a robot with appearance similar to the human body, allowing interaction with made-for-human tools or environments

self-maintenance autonomous learning avoiding harmful situations to people, property, and itself safe interacting with human beings and the environment

A humanoid robot is an autonomous robot that can adapt to changes in its environment or itself

http://upload.wikimedia.org/wikipedia/commons/0/05/HONDA_ASIMO.jpg

Supervised Learning

•learning approaches to regression & classification, neural networks

•learning from examples, learning from a teacher

Unsupervised Learning

•Reinforcement Learning

•learning approaches to sequential decision making

•learning from a critic, learning from delayed reward

Robot learning

Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.

• environment states S

• actions A

• rewards R.

State S

• value V

• action-value Q.

Reinforcement learning

S0

S1

S2

111 ,,,, ttttttt xxxxxx

1 ttt xx

is MDP

is NOT MDP

Agent – driver Environment – car on a road States – car pos,vel,acc Action – stearing wheel angle Action is result of a policy Reward – success or fail

xxx ,,

Markov decision process (MDP) - A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. The outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).

The environment is typically formulated as a finite-state Markov decision process

The basic reinforcement learning model applied to MDPs consists of:

.

• a set of environment states S;

• a set of actions A; and

• a set of scalar "rewards" in R.

At each time t, the agent perceives its state st ϵ S and the set of possible actions A(st). It chooses an action and receives from the environment the new state st + 1 and a reward rt. Based on these interactions, the reinforcement learning agent must develop a policy π : S →A which maximizes the quantity R = r0 + r1 + .. +rn for MDPs which have a terminal state, or the quantity

t

t

trR

for MDP witout terminal states.

A policy π determines which action should be performed in each state; a policy is a mapping from states to actions

Action Value Function

The value of a state is defined as the sum of the reinforcements received when starting in that state and following some fixed policy to a terminal state

0

1 ||)(k

tkt

k

tt ssrEssREsV

Value Function

0

1 ,|,|),(k

ttkt

k

ttt aassrEaassREasQ

Optimal value function V* Is a value function following optimal policy

π* : S →A , maximizing V (Q)

Value function update methods

Monte Carlo Method

Repeat forever Choose a random policy Generate an entire episode For each state s appearing in episode, compute return R = R + r(s) Value function V(s) is average return

)()()( tttt sVRsVsV

Drawback : can only be updated after R is calculated from a complete simulation run, can be applied for episodic tasks

learning rate

R

V

Temporal difference learning

is a prediction method. TD learning is a combination of Monte Carlo ideas and dinamic programming (DP) ideas. Neuroscience researches have proved the existence of this method in human and animal brains.

The TD method updates value or action value function immediately after visits to new states.

)()()()( 11 ttttt sVsVrsVsV

Benefits : suitable for non-episodic tasks, suitable for on-line implementation, better convergence

V

V

V

V

Eglibility traces – TD(λ)

Eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. They are a bridge from TD to Monte Carlo methods

repeat for all states in each step

V

V

V

V

)()()()()( 11 sesVsVrsVsV ttttt

tt

tt

tssifse

ssifsese

1)(

)()(

1

1

update algorithm :

Random walk example

0 1

1 2 3 4 50

0.2

0.4

0.6

0.8

1RANDOM WALK - VALUE FUNCTION AFTER 10 STEPS

V

states

1 2 3 4 50

0.2

0.4

0.6

0.8

1RANDOM WALK - VALUE FUNCTION AFTER 100 STEPS

V

states

True

TD

ET

MC1

MC2

Direct Q-Learning methods

SARSA

),(),(),(),( 111 ttttttttt asQasQrasQasQ

Because of choosing actual future value for state action update, we say that SARSA learns the Q values associated with taking the policy it follows itself. (on-Policy method)

Example : ε-greedy policy

)(

),(max

arand

asQa exploitation, exploration

Q-Learning

),(),(max),(),( 11 tttttttt asQasQrasQasQ

Because of choosing max future value for state action update, we say that Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. (Off-Policy method)

Cliff – walking problem

Q-Learning

SARSA

R=-100

R=-1

Case study – Ball in a cup game

Goal : Swing ball on a rope to the desired height at 0 velocity

State variables

hand position and velocity, rope angle and angular velocity

State boxing

x1 = [0.3 0]; x2 = [-1 0 1]; x3 = [0 : 360]; 18 values x4 = [-1.5 0 1.5]; Number of states = 3*3*18*3 = 342 Reward function

if (x > 0.65) || (x < -0.05) r = -500 - 5*abs(x_dot); elseif ((abs(theta-D1)<0.2) &&(abs(theta_dot) < 0.3)) r = 1000; else r = theta^2; end

Learning algorithm

Q-Learning, α = 0.4 γ = 0.99 λ = 0.9

States discrete representation of continuous problem

,,, xx

Actions – acceleration [-1,-0.5,0,+0.5,+1]

Simulation results of Q Learning

Experimental result of Q Learning

A central issue in the trajectory generation and modification is the choice of the representation (or encoding) of the trajectory

Dynamic Motion Primitives

Second order system + Trajectory modulation

Goal (g) Kernel function weights (w)

Discrete time Continous time

Trajectory representation

Second order system Trajectory modulation

Time evolution

Learning of Dynamic Motion Primitives can be accomplished by regression

REGRESION

Dynamic Motion Primitives – DMP-s

{

Policy gradient methods

Policy gradient methods are a type of reiforcement learning techniques that rely upon optimizing parameterized policies with respect to the expected return

}{)(0 k

h

k krEJ }max{J

- dicsount factor [0,1] r - reward

update using gradient update rule

hJhhh 1

- learning rate

local maximum !

Gradient estimation technique

Finite-difference Methods

ihih ,Random variantion of parameters ,

Experimental evaluation of J refihih JJJ )(,

Gradient estimation using regression hhh

T

hh JJ ])[( 1

hJhkh 1

Parameters update

khhhh ,2,1, ...

episode

transition

Liquid pouring learning - simulation

one transition

J estimation using balance

Experimental results

Goal learning

reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...

Documents