deep q-learning

27
Deep Q-Learning A Reinforcement Learning approach

Upload: nikolay-pavlov

Post on 16-Apr-2017

777 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Deep Q-Learning

Deep Q-LearningA Reinforcement Learning approach

Page 2: Deep Q-Learning

What is Reinforcement Learning?

- Much like biological agents behave- No supervisor, only a reward - Data is time dependent (non iid)- Feedback is delayed- Agent actions affect the data it receives

Page 3: Deep Q-Learning

Examples- Play checkers (1959)- Defeat the world champion at Backgammon (1992)- Control a helicopter (2008)- Make a robot to walk- Robocup Soccer- Play ATARI games better than humans (2014)- Defeat the world champion at Go (2016)

Videos

Page 4: Deep Q-Learning

Reward HypothesisAll goals can be described by the maximisation of expected cumulative reward

- Defeat the world champion at Go: +R / -R for winning/losing a game- Make a robot to walk: +R for forward, -R for falling over - Play ATARI games: +R / -R for increasing/decreasing score- Control a helicopter: + R / -R following trajectory / crashing

Page 5: Deep Q-Learning

Agent and Environment

Page 6: Deep Q-Learning

Fully Observable EnvironmentsFully Observable Environments (agent state = environment state):

- Agent directly observes environment- Example: chess board

Partially Observable Environments (agent state not equal environment state):

- Agent indirectly observes environment- Example: A robot with motion sensor or camera - Agent must construct its own state representation

Page 7: Deep Q-Learning

RL components: Policy and Value FunctionPolicy is agent’s behaviour function

- Maps from state to action - Deterministic policy: - Stochastic:

Value function is a is a prediction of future reward

- Used to evaluate state and select between actions-

Page 8: Deep Q-Learning

ModelPredicts what environment will do next:

Page 9: Deep Q-Learning

Maze example: r = -1 per time-step and policy

[David Silver. Advanced Topics: RL]

Page 10: Deep Q-Learning

Maze example: Value function and Model

[David Silver. Advanced Topics: RL]

Page 11: Deep Q-Learning

Exploration - Exploitation dilemma

Page 12: Deep Q-Learning

Math: Markov Decision Process (MDP)Almost all RL problems can be formalised as MDPs

It’s a tuple:

- S is finite set of states- A is finite set of actions- P is state transition probability matrix:- R is a reward function:- Discount factor:

Page 13: Deep Q-Learning

State-Value and Action-Value functions, Bellman eq.Expected return starting from state s, and then following policy :

Expected return starting from state s, taking action a, and then following policy :

Page 14: Deep Q-Learning

Finding an Optimal Policy- There is always optimal policy for any MPD- All optimal policies achieve the optimal value function - All optimal policies achieve the optimal action-value function

All you need is to find

Page 15: Deep Q-Learning

Bellman Opt Equation for state-value function

[David Silver. Advanced Topics: RL]

Page 16: Deep Q-Learning

Bellman Opt Equation for action-value function

[David Silver. Advanced Topics: RL]

Page 17: Deep Q-Learning

Bellman Opt Equation for state-value function

[David Silver. Advanced Topics: RL]

Page 18: Deep Q-Learning

Bellman Opt Equation for action-value function

[David Silver. Advanced Topics: RL]

Page 20: Deep Q-Learning

Q-Learning - model-free off-policy control algorithmModel-free (vs Model-based):

- MDP model is unknown, but experience can be sampled MDP - Model is known, but is too big to use, except by samples

Off-policy (vs On-policy):

- Can learn about policy from experience sampled from some other policy

Control (vs Prediction):

- Find best policy

Page 21: Deep Q-Learning

Q-Learning

[David Silver. Advanced Topics: RL]

Page 22: Deep Q-Learning

DQN - Q-Learning with function approximation

[Human-level control through deep reinforcement learning]

Page 23: Deep Q-Learning

[Human-level control through deep reinforcement learning]

Page 24: Deep Q-Learning

Issues with Q-learning with neural network- Data is sequential (non-iid)- Policy changes rapidly with slight changes to Q-values

- Policy may oscillate- Experience flows from one extreme to another

- Scale of rewards and Q-values is unknown- Unstable backpropagation due to large gradients

Page 25: Deep Q-Learning

DQN solutions- Use experience replay

- Breaks correlations in data- Learn from all past policies- Using off-policy Q-learning

- Freeze target Q-network- Avoid policy oscillations- Break correlations between Q-network and target

- Clip rewards and gradients

Page 26: Deep Q-Learning

Neon Demo