final poster (jun's changes 2016-06-01) - machine...

Evaluation 1. Learning Curves Figure 1 Figure 2 • The learning curves from Qlearning (figure1) and SARSA(figure 2) are shown above respectively. • Finding: Qlearning improves performance with fewer number of trials, but in longrun the performance are not guaranteed to b improving. 2. Performance Comparison Figure3 Figure4 • Figure 3 & Figure 4 displayed the average scores achieved in 1minute time limit by agent with Optimization Algorithm, agent trained in SARSA and agent trained in QLearning respectively. • Finding: Agent trained in SARSA performed better than the one with QLearning significantly. After 1e+06 iteration, the former could achieve 80 percent of performance of Optimization Agent. # Iter Opt SARSA QL 1e+04 77.504 36.858 18.023 1e+05 77.504 51.994 25.789 1e+06 77.504 61.830 36.567 Observations & Analysis • Using quadrant mapping of the state space, the size of state space is tremendously reduced and learning rate has been accelerated. • SARSA seems to outperform Q-Learning in long run, but also reveals a sluggish learning curve. • Q-Learning would reinforce its self-righteous Q- value even with small learning rate, and thus leads to considerably volatile performance. Discussion • Even with decreasing exploration probability, the Q-Learning is not stable. • Other approximation of the state space should be explored for better performance. • Various turning parameters should be implemented to improve the probability of convergence for Q-Learning. Methods 1. Mathematical Analysis • Snake Game is to find a self-avoiding walk(SAW) in R 2 • Picking the shortest SAW for each step is a NP-HARD problem 2. Approximated Optimal Solution • Find the shortest path from head to food • Guarantee the existence of path from head to tail after absorbing the food. • Otherwise follow the longest path from head to tail 3. Q-learning • Train agent to learn optimal policy from history of interaction with environment. History is a sequence of state-action-rewards • Off-policy Learning: Use Bellman equation as an iterative update • Use neural network to approximate value of Q-function by using Exploration of Reinforcement Learning to SNAKE Bowei Ma, Meng Tang, Jun Zhang CS229 Machine Learning, Stanford University Related Work 4. SARSA (State-Action-Reward-State-Action) • On-policy Learning: Along exploration, the agent iteratively approximates the value of a policy, and takes action follow that policy. 1.Risto Miikkulainen, Bobby Bryant, Ryan Cornelius, Igor Karpov, Kenneth Stanley, and Chern Han Yong. 2006. Computational Intelligence in Games. 2.Giovanni Viglietta. 2013. Gaming is a hard job, but someone has to do it! Abstract Reinforcement learning is an essential way to address problems where there’s no single correct solution. In this project, We combined convolutional neural network and reinforcement learning to train an agent to play the game Snake. The challenge is that the size of state space is extremely huge due to the fact that position of the snake affects the training results directly while it’s changing all the time. By training the agent in a reduced state space, we showed the comparisons among different reinforcement learning algorithms and approximation optimal solution. State eat food hit wall hit snake else reward +500 -100 -100 -10

Upload: others

Post on 20-Jun-2020

1 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Final Poster (Jun's changes 2016-06-01) - Machine Learningcs229.stanford.edu/proj2016spr/poster/060.pdf · Final_Poster (Jun's changes 2016-06-01) Created Date: 6/3/2016 5:58:33 AM

Evaluation1. Learning Curves

Figure 1 Figure 2

• The learning curves from Q-learning (figure1) and SARSA(figure 2) are shown above respectively.

• Finding: Q-learning improves performance with fewer number of trials, but in long-run the performance are not guaranteed to b improving.

2. Performance Comparison

Figure3 Figure4

• Figure 3 & Figure 4 displayed the average scores achieved in 1-minute time limit by agent with Optimization Algorithm, agent trained in SARSA and agent trained in Q-Learning respectively.

• Finding: Agent trained in SARSA performed better than the one with Q-Learning significantly. After 1e+06 iteration, the former could achieve 80 percent of performance of Optimization Agent.

# Iter Opt SARSA Q-L

1e+04 77.504 36.858 18.023

1e+05 77.504 51.994 25.789

1e+06 77.504 61.830 36.567

Observations & Analysis

• Using quadrant mapping of the state space, the size of state space is tremendously reduced and learning rate has been accelerated.

• SARSA seems to outperform Q-Learning in long run, but also reveals a sluggish learning curve.

• Q-Learning would reinforce its self-righteous Q-value even with small learning rate, and thus leads to considerably volatile performance.

Discussion• Even with decreasing exploration probability,

the Q-Learning is not stable.• Other approximation of the state space should

be explored for better performance.• Various turning parameters should be

implemented to improve the probability of convergence for Q-Learning.

Methods1. Mathematical Analysis• Snake Game is to find a self-avoiding walk(SAW) in R2

• Picking the shortest SAW for each step is a NP-HARD problem

2. Approximated Optimal Solution• Find the shortest path from head to food• Guarantee the existence of path from head to tail after absorbing the food.• Otherwise follow the longest path from head to tail

3. Q-learning• Train agent to learn optimal policy from history of interaction with environment.

History is a sequence of state-action-rewards• Off-policy Learning: Use Bellman equation as an iterative update

• Use neural network to approximate value of Q-function by using

Exploration of Reinforcement Learning to SNAKE Bowei Ma, Meng Tang, Jun ZhangCS229 Machine Learning, Stanford University

Related Work

4. SARSA (State-Action-Reward-State-Action)• On-policy Learning: Along exploration, the agent iteratively approximates the

value of a policy, and takes action follow that policy.

1.Risto Miikkulainen, Bobby Bryant, Ryan Cornelius, Igor Karpov, Kenneth Stanley, and Chern Han Yong. 2006. Computational Intelligence in Games.

2.Giovanni Viglietta. 2013. Gaming is a hard job, but someone has to do it!

AbstractReinforcement learning is an essential way to address problems wherethere’s no single correct solution. In this project, We combinedconvolutional neural network and reinforcement learning to train anagent to play the game Snake. The challenge is that the size of statespace is extremely huge due to the fact that position of the snakeaffects the training results directly while it’s changing all the time. Bytraining the agent in a reduced state space, we showed thecomparisons among different reinforcement learning algorithms andapproximation optimal solution.

State eat food hit wall hit snake elsereward +500 -100 -100 -10

Household Energy Disaggregation based on …cs229.stanford.edu/proj2016spr/report/067.pdf1 Household Energy Disaggregation based on Difference Hidden Markov Model Ali Hemmatifar ([email protected])

Pedro Jun's Design Portfolio

News Letter of Hur-jun's Dream - Vol. 1, SEP 2011

KREATIVITAS DESAIN POSTER DALAM KOMPETISI DESAIN POSTER ... · Untuk itulah, poster promosi lomba poster ... Pendekatan penyampaian pesan lewat poster terjadi hampir di semua ranah

Prediction of Stock Price Movement from Options Datacs229.stanford.edu/proj2016spr/poster/052.pdfTypeequationhere. Typeequationhere. Prediction of Stock Price Movement from Options

Poster Bodoni Font Poster

poster example ee poster

The Automated Travel Agent: Machine Learning for Hotel ...cs229.stanford.edu/proj2016spr/poster/017.pdf · The Automated Travel Agent: Machine Learning for Hotel Cluster Recommendation

91.5x122 cm Poster Template - SAGA91.5x122 cm Poster Template Subject Free PowerPoint poster templates Keywords poster presentation, poster design, poster template Created Date 12/6/2010

Jun's Presentation

Poster. Poster Bus Poster Bus poster