vrije universiteit brussel reinforcement learning 101 with ... › files › 48166689 ›...

Vrije Universiteit Brussel

Reinforcement Learning 101 with a Virtual Reality Game

Coppens, Youri; Bargiacchi, Eugenio; Nowe, Ann

Publication date:2019

License:Unspecified

Document Version:Accepted author manuscript

Link to publication

Citation for published version (APA):Coppens, Y., Bargiacchi, E., & Nowe, A. (2019). Reinforcement Learning 101 with a Virtual Reality Game. Paperpresented at 1st International Workshop on Education in Artificial Intelligence K-12, Cotai, Macao.

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal

Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Download date: 11. Jul. 2020

https://cris.vub.be/en/publications/reinforcement-learning-101-with-a-virtual-reality-game(113eb792-26ba-49e9-87b9-23403eceacef).html

https://cris.vub.be/en/persons/youri-coppens(9e7b1be5-94b0-4a0b-af5b-b148b9733236).html

https://cris.vub.be/en/persons/eugenio-bargiacchi(7c503ac1-e099-4e2e-9940-b1006e5bdce8).html

https://cris.vub.be/en/persons/ann-nowe(586343cd-1003-4945-aff8-30dfb067d227).html

https://cris.vub.be/en/publications/reinforcement-learning-101-with-a-virtual-reality-game(113eb792-26ba-49e9-87b9-23403eceacef).html

Reinforcement Learning 101 with a Virtual Reality Game

Youri Coppens , Eugenio Bargiacchi and Ann NoweVrije Universiteit Brussel

{yocoppen, ebargiac, anowe}@ai.vub.ac.be

Abstract

Our proof-of-concept demonstrates how VirtualReality can be used to explain the basic conceptsof Reinforcement Learning. This application visu-alizes the learning process of Watkins’ Q(λ), a fun-damental algorithm in the field, in the form of aninteractive treasure hunt game. A player takes therole of an autonomous agent, and must learn theshortest path to a hidden treasure through experi-ence. The application also allows an audience tofollow the game from an external display.

1 IntroductionRecent advances in technology have led people to come morefrequently in contact with Artificial Intelligence (AI) in ev-eryday systems, such as smartphones, vehicles, robots andonline chatbots. Whilst such systems can be very pow-erful tools to improve quality of life, they also carry so-cial, ethical and economical concerns [Yuste et al., 2017;Bryson and Winfield, 2017]. It is therefore important thatusers are educated and understand the systems they are inter-acting with, so that they can participate in public discussionson how to regulate AI in society.

One of the main techniques in AI to train autonomousagents is Reinforcement Learning (RL), where agents learninteractively how to behave within an environment in order tomaximize a rewarding feedback signal through experience.The concept of learning from interaction can be intuitivelyunderstood due to its psychological roots [Nowe and Brys,2016; Sutton and Barto, 2018]. It is an attractive subject toexplain to laymen, given the recent breakthroughs in AI re-search realized using RL. These breakthroughs span severalapplication domains such as robotics [Levine et al., 2016],chemistry [Zhou et al., 2017] and advertisement auction-ing [Jin et al., 2018]. Another notable breakthrough is Deep-Mind’s AlphaGo system, capable of surpassing human pro-fessional Go-players [Silver et al., 2017].

We demonstrate a novel system to teach the basic conceptsbehind RL to general audiences, without the necessity formathematical formulas or hands-on programming sessions. Itis a serious game, i.e. using fun activities for an educationalpurpose [Bergeron, 2005].

We use Virtual Reality (VR) to put the playing user inthe shoes of an autonomous agent with limited observations,demonstrating through direct experience how new knowledgeis acquired and exploited by said agent. Immersive VR tech-nology allows us to align the perspective of the user with thelearning agent as much as possible, creating a sense of pres-ence in the RL environment through the head-mounted dis-play, with limited access to outside information [Freina andOtt, 2015]. Additionally, VR technology enhances learningin K-12 and higher education settings when students play in-dividually [Merchant et al., 2014].

The player’s task is to find a treasure hidden in a grid-worldmaze which can be freely explored. All information collectedvia this exploration is fed to a Reinforcement Learning algo-rithm, Q(λ) [Watkins, 1989], which then displays the resultsof the learning back to the user via colors and numeric values.As more information is collected with each trial, the mazestarts to show a color gradient towards the hidden treasure,which in turn helps the user to select the optimal direction tomove.

As the player wanders through the maze, a moderator canexplain the mechanism through which the Q-values are com-puted, and how the different parameters of the algorithm(i.e., learning rate, discount factor and trace decay rate) af-fect the way that rewards travel between different states. Atthe same time, bystanders can oversee the game via an exter-nal monitor, which also shows a perfect information top viewof the maze and the current player. This allows groups to in-teract and experience the learning process together and thusdoes not restrict the experience to a single VR player.

In an earlier state, this demonstration has been successfullyused in an episode of the Flemish scientific TV-program, Uni-versteit van Vlaanderen, to introduce Reinforcement Learn-ing to a general audience.1 The intuitive nature of the gamemakes the introduced concepts easy to understand while atthe same time leaving space for more in-depth explanationswhen desired.

2 Related workSeveral tools are in the make to teach machine learning toK-12 students. IBM’s Machine Learning for Kids tool pro-

1The episode (in Dutch) can be retrieved on YouTube: https://youtu.be/17I7gzlXEmo

https://youtu.be/17I7gzlXEmo

https://youtu.be/17I7gzlXEmo

Figure 1: Regular set-up of our demonstration. The player wearsthe VR headset and plays the game next to a projection screen forthe other spectators. Bystanders can see the player’s point of viewand in addition, the audience has a top view of the world. A videoillustrating our demonstration, can be found at the following URL:https://youtu.be/sLJRiUBhQqM

vides a broad machine learning training using the visual pro-gramming language Scratch and contains two tutorial projectsusing RL. Besides introducing a young public to machinelearning concepts, this tool also intends to introduce the usersto programming and implementing machine learning algo-rithms. Similarly, Dalton Learning Lab is in the processof developing an educational machine learning toolkit usingScratch, AI4children. Our system solely focuses on grasp-ing the intuition behind the concept of RL with an immersiveexperience rather than teaching children to program an RLalgorithm.

3 Reinforcement LearningReinforcement learning tackles the problem of sequentialdecision-making within an environment, where an agent mustact in order to maximize collected reward over time. Thisproblem can be modeled through a Markov Decision Pro-cess (MDP). An MDP is defined by the tuple 〈S,A, T ,R, γ〉,where S represents the state space, A the action space,T : S × A × S → [0, 1] the transition probability func-tion, R : S × A → R the reward function and γ ∈ [0, 1] thediscount factor.

At each time step t, the agent observes the current stateof the environment st and chooses an action at based on theprobability in its current policy πt : S × A → [0, 1]. Afterperforming the action, the agent receives a reward rt from thedistributionR, and the state of the environment transitions tothe next state st+1 following the distribution T . The agent’sgoal is to learn how to maximize the expected return E[Rt],where Rt =

∑∞i=0 γ

irt+i, the sum of future discounted re-wards. The discount factor γ regulates the agent’s preferencefor immediate rewards over long-term rewards. Lower valuesset the focus on immediate reward, whilst higher values resultin a more balanced weighing of current and future reward inRt.

When T and R are unknown to the agent, it must learnthe optimal policy through interaction with the environmentto gain experience. One approach to this is value-based RL,

Figure 2: Pop-up menu to modify Q(λ)’s parameters.

where the agent estimates E[Rt] by learning a value function,e.g. a Q-function Q(s, a) = E[Rt|s = st, a = at], represent-ing the expected future reward from state s when executingaction a. A behavior policy then chooses actions based on theestimated Q-values in each state. The most common strategyis ε-greedy, where a probability ε ∈ [0, 1] determines in eachstep whether a random action is selected or the action withthe currently highest Q-value. Another common strategy isto sample the action from a Boltzmann distribution over Q-values. For the sake of interactivity in the maze game, theseautomated action selection strategies have been omitted. In-stead, we let the player select actions freely.

We focus on Watkins’ Q(λ) algorithm [Watkins, 1989] dueto its relative simplicity and the fact that it forms the funda-mental basis for contemporary RL algorithms. Q(λ) learnsthe value function in a tabular fashion by maintaining foreach state-action pair a Q-value Q(s, a), and an eligibilitytrace e(s, a), initially set to 0. For each step the agent takes,e(st, at) is set to 1 and the agent gains an experience sample(st, at, rt, st+1), from which a temporal difference (TD) er-ror is calculated δt = rt + γmaxa′Q(st+1, a

′) − Q(st, at).The agent will update all the Q-values by weighing the TD-error by its respective eligibility trace and a learning rateα ∈ (0, 1]: α · δ · e(s, a). The learning rate allows for in-cremental updates of the Q-function, which allows for bet-ter approximations in stochastic environments: the higher therandomness, the lower the parameter should be set for opti-mal learning. The eligibility trace is used in order to assigncredit for rewards to past interactions with the environment,which speeds up the learning process. The higher the tracefor a specific state-action pair, the higher the magnitude of theupdate. Once the Q-values are updated, the eligibility traceswill be decayed exponentially with a parameter λ ∈ [0, 1], asinteractions farther away in time are less likely to be directlyresponsible for new rewards.

4 Virtual Reality Maze GameWe present a VR treasure hunt game, designed to teach RLconcepts to general audiences in an engaging way. The gameputs the player in a foggy maze, with the task to find a hiddentreasure. The fog is designed to restrict the player’s visionto that of an autonomous agent, namely its current position(state) and available actions. As the maze is featureless andwith limited visibility, the player must rely on the informa-tion provided by the learning agent, rather than trying to au-

https://youtu.be/sLJRiUBhQqM

Figure 3: Available actions and respective Q-values in a cell of themaze from the player’s point of view.

tonomously find the treasure. The treasure allows the playerto intuitively grasp the concept of reward in a standard RLprocess. The user has complete control over the decision pro-cess, and can decide where to explore depending on the avail-able information. An overview of the game’s physical set-upcan be seen in Figure 1.

The player is paired with a Q(λ) learning agent, whichcomputes Q-values as described in Section 3. Parameters ofthe algorithm can be adjusted on the fly within a game-menu(Figure 2), allowing participants to explore their effects onthe learning process. As the player explores, Q(λ) updatesthe Q-values for each state-action pair, and displays them onthe ground (Figure 3). The highest Q-value for each stateis visualized by shading the floor of each cell in green, withthe shading proportional to the value: this allows the user tointuitively understand the idea of expected reward, and howvalues are discounted over time (Figure 4). Additionally, theeligibility traces of Q(λ) are shown to the user as a trail offloating arrows, with their size proportional to the value ofthe trace.

The player obtains a reward by finding a treasure hiddenwithin a chest in the maze. The maze contains multiplechests, which are only visible to the player from up close.Stepping in a cell that contains a chest causes it to open, re-veal its contents and provide its reward to Q(λ). Two typesof chests exist: treasure chests and empty chests. Treasurechests result in a reward of 10. Only a single treasure ispresent in a fixed location in the environment, which alsomarks the goal state of the maze. Empty chests, on the otherhand, do not give any reward, resulting in being indistinguish-able from empty cells for Q(λ). These were introduced afterpreliminary testing, as users would not willingly explore cellswhich looked empty. With these empty chests, it is possibleto simulate the fact that an autonomous agent has usually noway to know in advance whether it is advantageous to explorea certain state, and thus force the user to explore.

Bystanders can see the player’s point of view on an exter-nal monitor. Additionally, the audience has a top view of theworld (see Figures 1 and 4) which the player has no accessto. Having a complete overview of the true state of the worldhelps seeing the differences in both perspectives and realiz-ing why agents can have trouble completing tasks that seemtrivial to a human.

Figure 4: Top view of the virtual maze. After finding the treasure,the agent receives a reward and updates the cell values, shown herein green shades: a brighter cell represents a higher value. Note thearrows for the traces on the path the player has followed.

Our demonstration has the potential to educate the broadspectrum of K-12 pupils on the dynamics of ReinforcementLearning. To ensure the game progresses sufficiently and tokeep the spectating audience involved, a moderator can di-rect the demonstration. The moderator can enhance the userexperience by explaining the game’s purpose and the mech-anisms of Q(λ) on a level adapted to the present audience.For younger kids, the attention can be put on the game’s vi-sual aspects by relating these to familiar concepts. For in-stance, an analogy can be drawn between the shrinking eligi-bility trace arrows and the trail of bread crumbs from the taleof ‘Hansel and Gretel’, which also vanished over time as thecrumbs were eaten by birds. On the other side, with a moremature audience, the moderator can shift the focus to the waythat Q-values are updated and how the specific parameters in-fluence the agent’s learning progress.

A typical session develops as follows: initially the playerwanders in the maze, opening each chest found in the hope ofobtaining the treasure. This phase tends to last the longest, asthe player has no idea of where to go, and helps convey theidea that learning a new task can be very hard at the begin-ning, as an RL agent does not know how to reach its goal. Ifthe player truly has trouble finding the treasure, the moderatorcan provide hints to advance the game.

Once the treasure has been found, Q(λ) will propagate thenewly received reward back with the help of the eligibilitytraces, ensuring that a part of the executed path in the mazenow contains information useful to the player to find the trea-sure again. The player is then transported to a new randomplace in the maze and will eventually return in a previouslyvisited place containing this new information. This is a propermoment for the moderator to better explain what the Q-valuesmean, and how they increase in magnitude when the playeronce again makes their way to the treasure.

One can mention the strength of Reinforcement Learning

under the guise of ‘practice makes perfect’. As the task isrepeated, it will take less time for the player to enter a partof the maze which has been visited before and thus containsupdated Q-values. Players tend to start relying on finding Q-values rather than looking for treasure chests. Smart usersmay purposefully try to expand the area of the maze that iscovered by useful Q-values by repeatedly traveling betweenlow and high valued states. The player should now be ratherindependent, so the moderator can focus on the audience, an-swering questions and explaining topics in more detail.

Technical Details The demonstration was developed inC# using the Unity3D engine, the SteamVR plugin andthe VRTK software framework2. The user plays the gamethrough a HTC Vive, a consumer-grade VR system.

5 Conclusions and Future WorkWe have demonstrated a Virtual Reality game illustratingthe basic concepts and dynamics of a tabular ReinforcementLearning process. The game puts the user in the shoes of alearning agent with limited access to information, and teacheshow RL can find solutions to arbitrary problems. Apart fromthe positive informal feedback during several public demon-strations at fairs and tech showcases, we have not yet per-formed an experimental evaluation of the effectiveness of ourtool, although this is intended future work. The maze envi-ronment has been inspired by the standard grid-world prob-lem, however other goal-based problems could be consideredto illustrate the diverse applicability of RL. In addition, otherfundamental RL algorithms such as SARSA [Rummery andNiranjan, 1994] or REINFORCE [Williams, 1992] could beinserted to highlight different approaches. To a further ex-tent, multi-player VR games could be developed to illustratemulti-agent problems in a similar fashion as the single agentcase from this work.

References[Bergeron, 2005] Bryan P. Bergeron. Developing Serious

Games. Charles River Media game development series.Charles River Media, Hingham, Massachusetts, 2005.

[Bryson and Winfield, 2017] Joanna Bryson and Alan Win-field. Standardizing ethical design for artificial intelli-gence and autonomous systems. Computer, 50(5):116–119, 2017.

[Freina and Ott, 2015] Laura Freina and Michela Ott. A lit-erature review on immersive virtual reality in education:state of the art and perspectives. In Ion Roceanu, Flor-ica Moldoveanu, Stefan Trausan-Matu, Dragos Barbieru,Daniel Beligan, and Angela Ionita, editors, Proceedings ofthe 11th International Scientific Conference ”eLearningand Software for Education”, volume 1, pages 133–141.Carol I NDU Publishing House, 2015.

[Jin et al., 2018] Junqi Jin, Chengru Song, Han Li, Kun Gai,Jun Wang, and Weinan Zhang. Real-Time Bidding with

2https://github.com/ExtendRealityLtd/VRTK

Multi-Agent Reinforcement Learning in Display Advertis-ing. In Proceedings of the 27th ACM International Confer-ence on Information and Knowledge Management, pages2193–2201, New York, New York, USA, 2018. ACMPress.

[Levine et al., 2016] Sergey Levine, Chelsea Finn, TrevorDarrell, and Pieter Abbeel. End-to-end training of deepvisuomotor policies. Journal of Machine Learning Re-search, 17(39):1–40, 2016.

[Merchant et al., 2014] Zahira Merchant, Ernest T. Goetz,Lauren Cifuentes, Wendy Keeney-Kennicutt, and Trina J.Davis. Effectiveness of virtual reality-based instruction onstudents’ learning outcomes in K-12 and higher education:A meta-analysis. Computers & Education, 70:29–40, Jan-uary 2014.

[Nowe and Brys, 2016] Ann Nowe and Tim Brys. A Gen-tle Introduction to Reinforcement Learning. In StevenSchockaert and Pierre Senellart, editors, Scalable Un-certainty Management, volume 9858 of Lecture Notesin Computer Science, pages 18–32, Nice, France, 2016.Springer, Cham.

[Rummery and Niranjan, 1994] G. A. Rummery and M. Ni-ranjan. On-line Q-learning using connectionist systems.Technical Report CUED/F/INFENG/TR 166, CambridgeUniverstity, Engineering Department, Cambridge, UnitedKingdom, 1994.

[Silver et al., 2017] David Silver, Julian Schrittwieser,Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, AdrianBolton, Yutian Chen, Timothy Lillicrap, Fan Hui, LaurentSifre, George van den Driessche, Thore Graepel, andDemis Hassabis. Mastering the game of Go withouthuman knowledge. Nature, 550(7676):354–359, October2017.

[Sutton and Barto, 2018] Richard S. Sutton and Andrew G.Barto. Reinforcement Learning : An Introduction. Adap-tive Computation and Machine Learning. MIT Press,Cambridge, Massachusetts, USA, 2nd edition, 2018.

[Watkins, 1989] Christopher J. C. H. Watkins. Learningfrom Delayed Rewards. PhD thesis, King’s College, Cam-bridge, United Kingdom, May 1989.

[Williams, 1992] Ronald J. Williams. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8(3-4):229–256, 1992.

[Yuste et al., 2017] Rafael Yuste, Sara Goering, BlaiseAguera y Arcas, Guoqiang Bi, Jose M. Carmena, AdrianCarter, Joseph Fins, Phoebe Friesen, Jack Gallant, JaneHuggins, Judy Illes, Philipp Kellmeyer, Eran Klein, AdamMarblestone, Christine Mitchell, Erik Parens, MichellePham, Khara Ramos, Karen Rommelfanger, and JonathanWolpaw. Four ethical priorities for neurotechnologies andAI. Nature, 551(7679):159–163, November 2017.

[Zhou et al., 2017] Zhenpeng Zhou, Xiaocheng Li, andRichard N. Zare. Optimizing Chemical Reactions withDeep Reinforcement Learning. ACS Central Science,3(12):1337–1344, December 2017.

https://github.com/ExtendRealityLtd/VRTK

vrije universiteit brussel reinforcement learning 101 with ... › files › 48166689 ›...

Documents