lecture 4: q-learning (table) - github pageshunkim.github.io/ml/rl/rl04.pdf · 2017-10-02 ·...
TRANSCRIPT
![Page 1: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/1.jpg)
Lecture 4: Q-learning (table)exploit&exploration and discounted future reward
Reinforcement Learning with TensorFlow&OpenAI GymSung Kim <[email protected]>
![Page 2: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/2.jpg)
Dummy Q-learning algorithm
![Page 3: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/3.jpg)
Learning Q (s, a)?
![Page 4: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/4.jpg)
Learning Q(s, a) Table: one success!
11
11
1
111
![Page 5: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/5.jpg)
Exploit VS Exploration
11
11
1
111
http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf
![Page 6: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/6.jpg)
Exploit VS Exploration
0.0 0.0 0.0 0.0 0.0
![Page 7: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/7.jpg)
Exploit (weekday) VS Exploration (weekend)
0.5 0.6 0.3 0.2 0.5
![Page 8: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/8.jpg)
Exploit VS Exploration: E-greedy
e = 0.1
if rand < e: a = random
else: a = argmax(Q(s, a))
![Page 9: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/9.jpg)
Exploit VS Exploration: decaying E-greedy
for i in range (1000)
e = 0.1 / (i+1)
if random(1) < e: a = random
else: a = argmax(Q(s, a))
![Page 10: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/10.jpg)
Exploit VS Exploration: add random noise
0.5 0.6 0.3 0.2 0.5
![Page 11: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/11.jpg)
Exploit VS Exploration: add random noise
0.5 0.6 0.3 0.2 0.5
a = argmax(Q(s, a) + random_values)
a = argmax([0.5 0.6 0.3 0.2 0.5]+[0.1 0.2 0.7 0.3 0.1])
![Page 12: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/12.jpg)
Exploit VS Exploration: add random noise
0.5 0.6 0.3 0.2 0.5
for i in range (1000) a = argmax(Q(s, a) + random_values / (i+1))
![Page 13: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/13.jpg)
Exploit VS Exploration
11
11
1
111
![Page 14: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/14.jpg)
Dummy Q-learning algorithm
![Page 15: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/15.jpg)
Discounted future reward
11
11
1
111
1
1
![Page 16: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/16.jpg)
Learning Q (s, a) with discounted reward
![Page 17: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/17.jpg)
Discounted future reward
![Page 18: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/18.jpg)
Learning Q (s, a) with discounted reward
![Page 19: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/19.jpg)
Discounted reward ( = 0.9)
1
![Page 20: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/20.jpg)
Q-learning algorithm
![Page 21: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/21.jpg)
Q-Table Policy
(3) quality (reward) for the given action
(eg, LEFT: 0.5, RIGHT 0.1 UP: 0.0, DOWN: 0.8)
Q (s, a)
111 11111
1
1
(1) state, s
(2) action, a
![Page 22: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/22.jpg)
Convergence
• In deterministic worlds
• In finite states
Machine Learning, Tom Mitchell, 1997
![Page 23: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement](https://reader036.vdocuments.site/reader036/viewer/2022070716/5eda904e5f8d0d7f302a5591/html5/thumbnails/23.jpg)
Next
Lab: Q-learning Table