proximal policy optimization (ppo)speech.ee.ntu.edu.tw/~tlkagk/courses/mlds_2018/lecture...basic...

29
Proximal Policy Optimization (PPO) default reinforcement learning algorithm at OpenAI Policy Gradient On-policy → Off-policy Add constraint

Upload: others

Post on 09-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Proximal Policy Optimization (PPO)

default reinforcement learning algorithm at OpenAI

Policy Gradient

On-policy →Off-policy

Add constraint

Page 2: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

DeepMind https://youtu.be/gn4nRCC9TwQ

Page 3: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

OpenAI https://blog.openai.com/openai-baselines-ppo/

Page 4: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient (Review)

Page 5: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Basic Components

EnvActorReward

Function

VideoGame

Go

Get 20 scores when killing a monster

The rule of GO

You cannot control

Page 6: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

‱ Policy 𝜋 is a network with parameter 𝜃

‱ Input: the observation of machine represented as a vector or a matrix

‱ Output: each action corresponds to a neuron in output layer





pixelsfire

right

left

Score of an action

0.7

0.2

0.1

Take the action based on the probability.

Policy of Actor

Page 7: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Example: Playing Video Game

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2 : “fire”

(kill an alien)

Obtain reward 𝑟2 = 5

Page 8: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Start with observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Example: Playing Video Game

After many turns

Action 𝑎𝑇

Obtain reward 𝑟𝑇

Game Over(spaceship destroyed)

This is an episode.

We want the total reward be maximized.

Total reward:

𝑅 =

𝑡=1

𝑇

𝑟𝑡

Page 9: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Actor, Environment, Reward

𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠𝑇 , 𝑎𝑇Trajectory

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2





𝑝𝜃 𝜏

= 𝑝 𝑠1 𝑝𝜃 𝑎1|𝑠1 𝑝 𝑠2|𝑠1, 𝑎1 𝑝𝜃 𝑎2|𝑠2 𝑝 𝑠3|𝑠2, 𝑎2 ⋯

= 𝑝 𝑠1 ෑ

𝑡=1

𝑇

𝑝𝜃 𝑎𝑡|𝑠𝑡 𝑝 𝑠𝑡+1|𝑠𝑡 , 𝑎𝑡

Page 10: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Actor, Environment, Reward

Actor

𝑠1

𝑎1

Env

𝑠2

Env

𝑠1

𝑎1

Actor

𝑠2

𝑎2

Env

𝑠3

𝑎2





𝑅 𝜏 =

𝑡=1

𝑇

𝑟𝑡

Reward

𝑟1

Reward

𝑟2

updatedupdated

àŽ€đ‘…đœƒ =

𝜏

𝑅 𝜏 𝑝𝜃 𝜏

Expected Reward

= 𝐾𝜏~𝑝𝜃 𝜏 𝑅 𝜏

Page 11: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient àŽ€đ‘…đœƒ =

𝜏

𝑅 𝜏 𝑝𝜃 𝜏 đ›» àŽ€đ‘…đœƒ =?

đ›» àŽ€đ‘…đœƒ =

𝜏

𝑅 𝜏 đ›»đ‘đœƒ 𝜏 =

𝜏

𝑅 𝜏 𝑝𝜃 đœđ›»đ‘đœƒ 𝜏

𝑝𝜃 𝜏

=

𝜏

𝑅 𝜏 𝑝𝜃 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

≈1

𝑁

𝑛=1

𝑁

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏𝑛

𝑅 𝜏 do not have to be differentiable

It can even be a black box.

𝑓 đ‘„ đ›»đ‘™đ‘œđ‘”đ‘“ đ‘„

đ›»đ‘“ đ‘„ =

= 𝐾𝜏~𝑝𝜃 𝜏 𝑅 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

=1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

Page 12: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Policy Gradient

𝜏1: 𝑠11, 𝑎1

1 𝑅 𝜏1

𝑠21, 𝑎2

1





𝑅 𝜏1





𝜏2: 𝑠12, 𝑎1

2 𝑅 𝜏2

𝑠22, 𝑎2

2





𝑅 𝜏2



Given policy 𝜋𝜃

𝜃 ← 𝜃 + đœ‚đ›» àŽ€đ‘…đœƒ

đ›» àŽ€đ‘…đœƒ =

1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

UpdateModel

DataCollection

đ›» àŽ€đ‘…đœƒ = 𝐾𝜏~𝑝𝜃 𝜏 𝑅 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

only used once

Page 13: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

𝜃 ← 𝜃 + đœ‚đ›» àŽ€đ‘…đœƒ

đ›» àŽ€đ‘…đœƒ =1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

Implementation





fire

right

left 1

0

0

Consider as classification problem

𝑎𝑡𝑛

𝑠𝑡𝑛

1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛 1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 𝑙𝑜𝑔𝑝𝜃 𝑎𝑡𝑛|𝑠𝑡

𝑛1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

TF, pyTorch 


𝑎𝑡𝑛𝑠𝑡

𝑛 𝑅 𝜏𝑛

Page 14: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 1: Add a Baseline𝜃 ← 𝜃 + đœ‚đ›» àŽ€đ‘…đœƒ

đ›» àŽ€đ‘…đœƒ ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

Ideal case

Sampling



a b c a b c

It is probability 


a b c

Not sampled

a b c

The probability of the actions not sampled will decrease.

1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 − 𝑏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

It is possible that 𝑅 𝜏𝑛 is always positive.

𝑏 ≈ 𝐾 𝑅 𝜏

Page 15: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 2: Assign Suitable Credit

𝑠𝑎 , 𝑎1 𝑠𝑏 , 𝑎2 𝑠𝑐 , 𝑎3

+0+5 -2

𝑅 = +3

× 3 × 3 × 3

đ›» àŽ€đ‘…đœƒ ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 − 𝑏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

𝑡â€Č=𝑡

𝑇𝑛𝑟𝑡â€Č𝑛

𝑠𝑎 , 𝑎2 𝑠𝑏 , 𝑎2 𝑠𝑐 , 𝑎3

+0-5 -2

𝑅 = −7

× −7 × −7 × −7× −2 × −2 × −2 × −2

Page 16: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Tip 2: Assign Suitable Credit

đ›» àŽ€đ‘…đœƒ ≈1

𝑁

𝑛=1

𝑁

𝑡=1

𝑇𝑛

𝑅 𝜏𝑛 − 𝑏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛

𝑡â€Č=𝑡

𝑇𝑛𝑟𝑡â€Č𝑛

𝑡â€Č=𝑡

đ‘‡đ‘›đ›Ÿđ‘Ą

â€Č−𝑡𝑟𝑡â€Č𝑛

đ›Ÿ < 1Add discount factor

Can be state-dependent

Advantage Function

𝐮𝜃 𝑠𝑡 , 𝑎𝑡

How good it is if we take 𝑎𝑡other than other actions at 𝑠𝑡.

Estimated by “critic” (later)

Page 17: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

From on-policy to off-policy

Using the experience more than once

Page 18: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy v.s. Off-policy

‱ On-policy: The agent learned and the agent interacting with the environment is the same.

‱ Off-policy: The agent learned and the agent interacting with the environment is different.

é˜żć…‰äž‹æŁ‹ 䜐ç‚șäž‹æŁ‹ă€é˜żć…‰ćœšæ—é‚Šçœ‹

Page 19: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy → Off-policy

đ›» àŽ€đ‘…đœƒ = 𝐾𝜏~𝑝𝜃 𝜏 𝑅 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

đžđ‘„~𝑝[𝑓 đ‘„ ]

‱ Use 𝜋𝜃 to collect data. When 𝜃 is updated, we have to sample training data again.

‱ Goal: Using the sample from 𝜋𝜃â€Č to train 𝜃. 𝜃â€Č is fixed, so we can re-use the sample data.

Importance Sampling

≈1

𝑁

𝑖=1

𝑁

𝑓 đ‘„đ‘–

đ‘„đ‘– is sampled from 𝑝 đ‘„

We only have đ‘„đ‘– sampled from 𝑞 đ‘„

= à¶±đ‘“ đ‘„ 𝑝 đ‘„ đ‘‘đ‘„ = à¶±đ‘“ đ‘„đ‘ đ‘„

𝑞 đ‘„đ‘ž đ‘„ đ‘‘đ‘„ = đžđ‘„~𝑞[𝑓 đ‘„

𝑝 đ‘„

𝑞 đ‘„]

Importance weight

Page 20: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Issue of Importance Sampling

đžđ‘„~𝑝[𝑓 đ‘„ ] = đžđ‘„~𝑞[𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„]

Vđ‘Žđ‘Ÿđ‘„~𝑝[𝑓 đ‘„ ] ≠ Vđ‘Žđ‘Ÿđ‘„~𝑞[𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„]

𝑉𝐮𝑅 𝑋

= 𝐾 𝑋2 − 𝐾 𝑋 2

Vđ‘Žđ‘Ÿđ‘„~𝑝 𝑓 đ‘„ = đžđ‘„~𝑝 𝑓 đ‘„ 2 − đžđ‘„~𝑝 𝑓 đ‘„2

Vđ‘Žđ‘Ÿđ‘„~𝑞[𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„] = đžđ‘„~𝑞 𝑓 đ‘„

𝑝 đ‘„

𝑞 đ‘„

2

− đžđ‘„~𝑞 𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„

2

= đžđ‘„~𝑝 𝑓 đ‘„ 2𝑝 đ‘„

𝑞 đ‘„âˆ’ đžđ‘„~𝑝 𝑓 đ‘„

2

Page 21: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Issue of Importance Sampling

đžđ‘„~𝑝[𝑓 đ‘„ ] = đžđ‘„~𝑞[𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„]

𝑓 đ‘„

𝑝 đ‘„ 𝑞 đ‘„

Very large weight

đžđ‘„~𝑝 𝑓 đ‘„ is negative

đžđ‘„~𝑝 𝑓 đ‘„ is positive?negative

Page 22: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy → Off-policy

đ›» àŽ€đ‘…đœƒ = 𝐾𝜏~𝑝𝜃 𝜏 𝑅 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

‱ Use 𝜋𝜃 to collect data. When 𝜃 is updated, we have to sample training data again.

‱ Goal: Using the sample from 𝜋𝜃â€Č to train 𝜃. 𝜃â€Č is fixed, so we can re-use the sample data.

đ›» àŽ€đ‘…đœƒ = 𝐾𝜏~𝑝𝜃â€Č

𝜏

𝑝𝜃 𝜏

𝑝𝜃â€Č 𝜏𝑅 𝜏 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝜏

Importance Sampling

đžđ‘„~𝑝[𝑓 đ‘„ ] = đžđ‘„~𝑞[𝑓 đ‘„đ‘ đ‘„

𝑞 đ‘„]

‱ Sample the data from 𝜃â€Č.

‱ Use the data to train 𝜃 many times.

Page 23: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

On-policy → Off-policy

= 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃[𝐮𝜃 𝑠𝑡 , 𝑎𝑡 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡

𝑛|𝑠𝑡𝑛 ]

= 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃â€Č[𝑃𝜃 𝑠𝑡 , 𝑎𝑡𝑃𝜃â€Č 𝑠𝑡 , 𝑎𝑡

𝐮𝜃 𝑠𝑡 , 𝑎𝑡 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛 ]

𝑓 đ‘„ đ›»đ‘™đ‘œđ‘”đ‘“ đ‘„đ›»đ‘“ đ‘„ =

𝐮𝜃â€Č𝑠𝑡 , 𝑎𝑡

This term is from sampled data.

đœđœƒâ€Č𝜃 = 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃â€Č

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃â€Č 𝑎𝑡|𝑠𝑡

𝐮𝜃â€Č𝑠𝑡 , 𝑎𝑡

= 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃â€Č[𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃â€Č 𝑎𝑡|𝑠𝑡

𝑝𝜃 𝑠𝑡𝑝𝜃â€Č 𝑠𝑡

𝐮𝜃 𝑠𝑡 , 𝑎𝑡 đ›»đ‘™đ‘œđ‘”đ‘đœƒ 𝑎𝑡𝑛|𝑠𝑡

𝑛 ]

Gradient for update

When to stop?

Page 24: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Add Constraintç©©çŽźç©©æ‰“ïŒŒæ­„æ­„ç‚ș營

Page 25: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

PPO / TRPO

đœđ‘ƒđ‘ƒđ‘‚đœƒâ€Č 𝜃 = đœđœƒ

â€Č𝜃 − đ›œđŸđż 𝜃, 𝜃â€Č

đœđœƒâ€Č𝜃 = 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃â€Č

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃â€Č 𝑎𝑡|𝑠𝑡

𝐮𝜃â€Č𝑠𝑡 , 𝑎𝑡

đœđ‘‡đ‘…đ‘ƒđ‘‚đœƒâ€Č 𝜃 = 𝐾 𝑠𝑡,𝑎𝑡 ~𝜋𝜃â€Č

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃â€Č 𝑎𝑡|𝑠𝑡

𝐮𝜃â€Č𝑠𝑡 , 𝑎𝑡

đŸđż 𝜃, 𝜃â€Č < 𝛿

TRPO (Trust Region Policy Optimization)

Proximal Policy Optimization (PPO)

𝜃 cannot be very different from 𝜃â€Č

Constraint on behavior not parameters

𝑓 đ‘„ đ›»đ‘™đ‘œđ‘”đ‘“ đ‘„đ›»đ‘“ đ‘„ =

Page 26: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

PPO algorithm

‱ Initial policy parameters 𝜃0

‱ In each iteration

‱ Using 𝜃𝑘 to interact with the environment to collect

𝑠𝑡 , 𝑎𝑡 and compute advantage 𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

‱ Find 𝜃 optimizing đœđ‘ƒđ‘ƒđ‘‚ 𝜃

‱ If đŸđż 𝜃, 𝜃𝑘 > đŸđżđ‘šđ‘Žđ‘„, increase đ›œ

‱ If đŸđż 𝜃, 𝜃𝑘 < đŸđżđ‘šđ‘–đ‘›, decrease đ›œ

đœđ‘ƒđ‘ƒđ‘‚đœƒđ‘˜ 𝜃 = đœđœƒ

𝑘𝜃 − đ›œđŸđż 𝜃, 𝜃𝑘

Update parameters several times

Adaptive KL Penalty

đœđœƒđ‘˜đœƒ ≈

𝑠𝑡,𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

Page 27: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

đœđ‘ƒđ‘ƒđ‘‚đœƒđ‘˜ 𝜃 = đœđœƒ

𝑘𝜃 − đ›œđŸđż 𝜃, 𝜃𝑘

đœđœƒđ‘˜đœƒ ≈

𝑠𝑡,𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

PPO algorithm

PPO2 algorithm

đœđ‘ƒđ‘ƒđ‘‚2𝜃𝑘 𝜃 ≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡 , 𝑐𝑙𝑖𝑝

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀

≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛 𝑐𝑙𝑖𝑝𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀, 1 + 휀 𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡1 1 + 휀1 − 휀

1

1 + 휀

1 − 휀

Page 28: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

đœđ‘ƒđ‘ƒđ‘‚đœƒđ‘˜ 𝜃 = đœđœƒ

𝑘𝜃 − đ›œđŸđż 𝜃, 𝜃𝑘

đœđœƒđ‘˜đœƒ ≈

𝑠𝑡,𝑎𝑡

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

PPO algorithm

PPO2 algorithm

đœđ‘ƒđ‘ƒđ‘‚2𝜃𝑘 𝜃 ≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡 , 𝑐𝑙𝑖𝑝

𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀

≈

𝑠𝑡,𝑎𝑡

𝑚𝑖𝑛 𝑐𝑙𝑖𝑝𝑝𝜃 𝑎𝑡|𝑠𝑡𝑝𝜃𝑘 𝑎𝑡|𝑠𝑡

, 1 − 휀, 1 + 휀 𝐮𝜃𝑘𝑠𝑡 , 𝑎𝑡

1 1 + 휀1 − 휀

1

1 + 휀

1 − 휀

1 1 + 휀1 − 휀

1

1 + 휀

1 − 휀

𝐮 > 0 𝐮 < 0

Page 29: Proximal Policy Optimization (PPO)speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture...Basic Components Actor Env Reward Function Video Game Go Get 20 scores when killing a monster

Experimental Results

https://arxiv.org/abs/1707.06347