ir-vic: unsupervised discovery of sub-goals for transfer in rl · ir-vic: unsupervised discovery of...

IR-VIC: Unsupervised Discovery of Sub-goals for Transfer in RL

Nirbhay Modhe1 , Prithvijit Chattopadhyay1 , Mohit Sharma1 , Abhishek Das1 ,Devi Parikh1,2 , Dhruv Batra1,2 and Ramakrishna Vedantam2

1Georgia Institute of Technology2Facebook AI Research

nirbhaym,prithvijit3,mohit.sharma,abhshkdz,parikh,[email protected], [email protected]

AbstractWe propose a novel framework to identify sub-goals useful for exploration in sequential decisionmaking tasks under partial observability. We utilizethe variational intrinsic control framework (Gre-gor et.al., 2016) which maximizes empowerment –the ability to reliably reach a diverse set of statesand show how to identify sub-goals as states withhigh necessary option information through an in-formation theoretic regularizer. Despite being dis-covered without explicit goal supervision, our sub-goals provide better exploration and sample com-plexity on challenging grid-world navigation taskscompared to supervised counterparts in prior work.

1 IntroductionA common approach in reinforcement learning (RL) is to de-compose an original decision making problem into a set ofsimpler decision making problems – each terminating into anidentified sub-goal. Beyond such a decomposition or abstrac-tion being evident in humans (e.g. adding salt is a sub-goal inthe process of cooking a dish) [Hayes-Roth and Hayes-Roth,1979], sub-goal identification is also useful from a practi-cal perspective of constructing policies that transfer to noveltasks (e.g. adding salt is a useful sub-goal across a large num-ber of dishes one might want to cook, corresponding to dif-ferent ‘end’ goals).

However, identifying sub-goals that can accelerate learn-ing while also being re-usable across tasks or environmentsis a challenge in itself. Constructing such sub-goals often re-quires knowledge of the task structure (supervision) and mayfail in cases where 1) dense rewards are absent [Pathak et al.,2017], 2) rewards require extensive hand engineering and do-main knowledge (hard to scale), and 3) where the notion ofreward may not be obvious [Lillicrap et al., 2015]. In thiswork, we demonstrate a method for identifying sub-goals inan “unsupervised” manner – without any external rewards orgoals. We show that our sub-goals generalise to novel par-tially observed environments and goal-driven tasks, leading tocomparable (or better) performance (via. better exploration)on downstream tasks compared to prior work on goal-drivensub-goals [Goyal et al., 2019].

Figure 1: Left: The VIC framework [Gregor et al., 2016] in a navi-gation context: an agent learns high-level macro-actions (or options)to reach different states in an environment reliably without any ex-trinsic reward. Right: IR-VIC identifies sub-goals as states wherenecessary option information is high (darker shades of red) for anempowered agent. Identification of unsupervised sub-goals leads toimproved transfer to novel environments.

We study sub-goals in the framework of quantifying theminimum information necessary for taking actions by anagent. [van Dijk and Polani, 2011] have shown that in thepresence of an external goal, the minimum goal informationrequired by an agent for taking an action is a useful measureof sub-goal states. [Goyal et al., 2019] demonstrate that foraction A, state S and a goal G, such sub-goals can be ef-ficiently learnt by imposing a bottleneck on the informationI(A,G|S). We show that replacing the goal with an intrin-sic objective admits a strategy for discovery of sub-goals in acompletely unsupervised manner.

Our choice of intrinsic objective is the Variational Intrin-sic Control (VIC) formulation [Gregor et al., 2016] to learnoptions Ω that maximize the mutual information I(Sf ,Ω),referred to as empowerment, where Sf is the final state in atrajectory [Salge et al., 2013]. To see why this maximizesempowerment, notice that I(Sf ,Ω) = H(Sf ) − H(Sf |Ω),whereH(.) denotes entropy. Thus, empowerment maximizesthe diversity in final states Sf while learning options highlypredictive of Sf . We demonstrate that by limiting the infor-mation the agent uses about the selected option Ω while max-imizing empowerment, a sparse set of states emerge wherethe necessary option information I(Ω, A|S) is high – we in-terpret these states as our unsupervised sub-goals. We callour approach Information Regularized VIC (IR-VIC). Al-though IR-VIC is similar in spirit to [Goyal et al., 2019;Polani et al., 2006], it is important to note that we use latentoptions Ω instead of external goals – removing any depen-

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

2022

dence on the task-structure. To summarize our contributions,

• We propose Information Regularized VIC (IR-VIC), anovel framework to identify sub-goals in a task-agnosticmanner, by regularizing relevant option information.

• Theoretically, we show that the proposed objective is asandwich bound on the empowerment I(Ω, Sf ) – this isthe only useful upper bound we are aware of.

• We show that our sub-goals are transferable and leadto improved sample-efficiency on goal-driven tasks innovel, partially-observable environments. On a chal-lenging grid-world navigation task, our method outper-forms (a re-implementation of) [Goyal et al., 2019].

2 Methods2.1 NotationWe consider a Partially Observable Markov Decision Pro-cess (POMDP), defined by the tuple (S,X ,A,P, r), s ∈ Sis the state, x ∈ X is the partial observation of the stateand a ∈ A is an action from a discrete action space. P :S × S × A denotes an unknown transition function, repre-senting p (st+1|st, at) : st, st+1 ∈ S, At ∈ A. Both VICand IR-VIC initially train an option (Ω) conditioned policyπ(at|ω, xt), where ω ∈ 1, · · · , |Ω|. During transfer, all ap-proaches (including baselines) train a goal-conditioned policyπ(at|xt, gt) where gt is the goal information at time t. Fol-lowing standard practice [Cover and Thomas, 1991], we de-note random variables in uppercase (Ω), and items from thesample space of random variables in lowercase (ω).

2.2 Variational Intrinsic Control (VIC)VIC maximizes the mutual information between options Ωand the final (option termination) state Sf given s0, i.e.I(Sf ,Ω | S0 = s0), which encourages the agent to learnoptions that reliably reach a diverse set of states. This objec-tive is estimated by sampling an option from a prior at thestart of a trajectory, following it until termination, and in-ferring the sampled option given the final and initial states.Informally, VIC maximizes the empowerment for an agent,i.e. its internal options Ω have a high degree of correspon-dence to the states of the world Sf that it can reach. VICformulates a variational lower bound on this mutual infor-mation. Specifically, let p(ω | s0) = p(ω) be a prior on op-tions (we keep the prior fixed as per [Eysenbach et al., 2018]),pJ(sf | ω, s0) is defined as the (unknown) terminal state dis-tribution achieved when executing the policy π(at | ω, st),and qν(ω | sf , s0) denote a (parameterized) variational ap-proximation to the true posterior on options given Sf and S0.Then:

I(Ω, Sf | S0 = s0)

≥ E Ω∼p(ω)

Sf∼pJ (sf |Ω,S0=s0)

[log

qν(Ω | Sf , S0 = s0)

p(Ω)

](1)

= JV IC(Ω, Sf ; s0)

VIC:

S1<latexit sha1_base64="bGuDBPcTq9gw6PXchSnsrIJHf/I=">AAAB6nicbZDLSsNAFIZP6q3WW9WlCINFcFUSKdTuCm66rNReoA1lMp20QyeTMDMRaugjuHGhiFufxEdw58pXcZoU8fbDwMd/zuGc+b2IM6Vt+93KrayurW/kNwtb2zu7e8X9g44KY0lom4Q8lD0PK8qZoG3NNKe9SFIceJx2venlot69oVKxUFzrWUTdAI8F8xnB2lit1tAZFkt22U6F/oKzhFL9+PWWKP7RHBbfBqOQxAEVmnCsVN+xI+0mWGpGOJ0XBrGiESZTPKZ9gwIHVLlJeuocnRpnhPxQmic0St3vEwkOlJoFnukMsJ6o37WF+V+tH2v/wk2YiGJNBckW+TFHOkSLf6MRk5RoPjOAiWTmVkQmWGKiTTqFNIRaKpRBtbKEmvMVQue87FTK1SuTRgMy5eEITuAMHKhCHRrQhDYQGMMdPMCjxa1768l6zlpz1nLmEH7IevkEFrqR0Q==</latexit>S0<latexit sha1_base64="lxCJmnO99TURyU2f/X2u96J1dHs=">AAAB6nicbZDLSsNAFIZP6q3WW9WlCINFcFUSKdTuCm66rNReoA1lMp20QyeTMDMRaugjuHGhiFufxEdw58pXcZoU8fbDwMd/zuGc+b2IM6Vt+93KrayurW/kNwtb2zu7e8X9g44KY0lom4Q8lD0PK8qZoG3NNKe9SFIceJx2venlot69oVKxUFzrWUTdAI8F8xnB2lit1tAeFkt22U6F/oKzhFL9+PWWKP7RHBbfBqOQxAEVmnCsVN+xI+0mWGpGOJ0XBrGiESZTPKZ9gwIHVLlJeuocnRpnhPxQmic0St3vEwkOlJoFnukMsJ6o37WF+V+tH2v/wk2YiGJNBckW+TFHOkSLf6MRk5RoPjOAiWTmVkQmWGKiTTqFNIRaKpRBtbKEmvMVQue87FTK1SuTRgMy5eEITuAMHKhCHRrQhDYQGMMdPMCjxa1768l6zlpz1nLmEH7IevkEFTaR0A==</latexit>

S2<latexit sha1_base64="PQNSZ1bxo1KnhKitUaJBL6PDGdM=">AAAB6nicbZDLSsNAFIZP6q3WW9WlCINFcFWSUqjdFdx0Wam9QA1lMp20QyeTMDMRaugjuHGhiFufxEdw58pXcZoU8fbDwMd/zuGc+b2IM6Vt+93KrayurW/kNwtb2zu7e8X9g64KY0loh4Q8lH0PK8qZoB3NNKf9SFIceJz2vOnFot67oVKxUFzpWUTdAI8F8xnB2ljt9rAyLJbssp0K/QVnCaXG8estUfyjNSy+XY9CEgdUaMKxUgPHjrSbYKkZ4XReuI4VjTCZ4jEdGBQ4oMpN0lPn6NQ4I+SH0jyhUep+n0hwoNQs8ExngPVE/a4tzP9qg1j7527CRBRrKki2yI850iFa/BuNmKRE85kBTCQztyIywRITbdIppCHUU6EMatUl1J2vELqVslMt1y5NGk3IlIcjOIEzcKAGDWhCCzpAYAx38ACPFrfurSfrOWvNWcuZQ/gh6+UTGD6R0g==</latexit>

<latexit sha1_base64="XpPmDc9AxS5QvMta/bVaR/XE0mg=">AAAB7XicbZDLSgMxFIYz9VbrrepSkGARXJUZKdSuLLjpzhbsBdqhZNJMG5tkhiQjlKFL925cKOLWV+hTuHDnM/gSpjNFvP0Q+PjPOZyT3wsZVdq2363M0vLK6lp2PbexubW9k9/da6kgkpg0ccAC2fGQIowK0tRUM9IJJUHcY6TtjS/m9fYNkYoG4kpPQuJyNBTUpxhpY7V6l5wMUT9fsIt2IvgXnAUUzl9njY/bw1m9n3/rDQIccSI0ZkiprmOH2o2R1BQzMs31IkVChMdoSLoGBeJEuXFy7RQeG2cA/UCaJzRM3O8TMeJKTbhnOjnSI/W7Njf/q3Uj7Z+5MRVhpInA6SI/YlAHcP51OKCSYM0mBhCW1NwK8QhJhLUJKJeEUEkEUyiXFlBxvkJonRadUrHcsAvVGkiVBQfgCJwAB5RBFdRAHTQBBtfgDjyARyuw7q0n6zltzViLmX3wQ9bLJ38pk/M=</latexit>

p(!|s0)<latexit sha1_base64="M2GRNZSYOJZDJp8hrOWSJBbV1ow=">AAAB9XicbZBLSwMxFIUzPmt9VV26CS1CRSgzUqjdFd24rGAf0BlLJs20oUlmSDLKUPsvXLhxoYhb/4u7/hvTmSK+DgQ+zrmXXI4fMaq0bc+speWV1bX13EZ+c2t7Z7ewt99WYSwxaeGQhbLrI0UYFaSlqWakG0mCuM9Ixx9fzPPOLZGKhuJaJxHxOBoKGlCMtLFuorIbcjJE91D17eN+oWRX7FTwLzgLKDWK7snDrJE0+4UPdxDimBOhMUNK9Rw70t4ESU0xI9O8GysSITxGQ9IzKBAnypukV0/hkXEGMAileULD1P2+MUFcqYT7ZpIjPVK/s7n5X9aLdXDmTaiIYk0Ezj4KYgZ1COcVwAGVBGuWGEBYUnMrxCMkEdamqHxaQj0VzKBWXUDd+SqhfVpxqpXalWnjHGTKgUNQBGXggBpogEvQBC2AgQSP4Bm8WHfWk/VqvWWjS9Zi5wD8kPX+CccclW8=</latexit>

A0<latexit sha1_base64="MNrjVOODcKYYlG45h2eUefIvi6w=">AAAB6nicbZDLSgMxFIbP1Futt6rgxk2wCK7KjBRqd7VuXLZoL9AOJZNm2tBMZkgyQil9BDcuFHHr1rfwCdy58VlMZ4p4+yHw8Z9zOCe/F3GmtG2/W5ml5ZXVtex6bmNza3snv7vXUmEsCW2SkIey42FFORO0qZnmtBNJigOP07Y3vpjX2zdUKhaKaz2JqBvgoWA+I1gb6+q8b/fzBbtoJ0J/wVlAoXrQ+GAvtdd6P//WG4QkDqjQhGOluo4daXeKpWaE01muFysaYTLGQ9o1KHBAlTtNTp2hY+MMkB9K84RGift9YooDpSaBZzoDrEfqd21u/lfrxto/c6dMRLGmgqSL/JgjHaL5v9GASUo0nxjARDJzKyIjLDHRJp1cEkIlEUqhXFpAxfkKoXVadErFcsOkUYNUWTiEIzgBB8pQhUuoQxMIDOEW7uHB4tad9Wg9pa0ZazGzDz9kPX8C1i2Rnw==</latexit>

A1<latexit sha1_base64="ydWVcUVHypYxaDEUHRykGhmLis4=">AAAB6nicbZDLSgMxFIbP1Futt6rgxk2wCK7KjBRqd7VuXLZoL9AOJZNm2tBMZkgyQil9BDcuFHHr1rfwCdy58VlMZ4p4+yHw8Z9zOCe/F3GmtG2/W5ml5ZXVtex6bmNza3snv7vXUmEsCW2SkIey42FFORO0qZnmtBNJigOP07Y3vpjX2zdUKhaKaz2JqBvgoWA+I1gb6+q87/TzBbtoJ0J/wVlAoXrQ+GAvtdd6P//WG4QkDqjQhGOluo4daXeKpWaE01muFysaYTLGQ9o1KHBAlTtNTp2hY+MMkB9K84RGift9YooDpSaBZzoDrEfqd21u/lfrxto/c6dMRLGmgqSL/JgjHaL5v9GASUo0nxjARDJzKyIjLDHRJp1cEkIlEUqhXFpAxfkKoXVadErFcsOkUYNUWTiEIzgBB8pQhUuoQxMIDOEW7uHB4tad9Wg9pa0ZazGzDz9kPX8C17GRoA==</latexit>

(at|!, st)<latexit sha1_base64="Z4icPq21uiwFzCfFHt/TBcF8i9c=">AAAB/nicbZDLSgMxFIYz9VbrbVRcuQktQkUpM1Ko3RXduKxgL9ApQyZN29DMheSMUMaCD+ELuHGhiFufw13fxnRaxNsPgY//P4ccfi8SXIFlTY3M0vLK6lp2PbexubW9Y+7uNVUYS8oaNBShbHtEMcED1gAOgrUjyYjvCdbyRpezvHXLpOJhcAPjiHV9Mgh4n1MC2nLNAyfiReLCHXZCnw3IKVYuHLtmwSpZqfBfsBdQqOWdk4dpbVx3zQ+nF9LYZwFQQZTq2FYE3YRI4FSwSc6JFYsIHZEB62gMiM9UN0nPn+Aj7fRwP5T6BYBT9/tGQnylxr6nJ30CQ/U7m5n/ZZ0Y+ufdhAdRDCyg84/6scAQ4lkXuMcloyDGGgiVXN+K6ZBIQkE3lktLqKbCc6iUF1C1v0ponpXscqlyrdu4QHNl0SHKoyKyUQXV0BWqowaiKEGP6Bm9GPfGk/FqvM1HM8ZiZx/9kPH+CUUYmG8=</latexit>

T = 2<latexit sha1_base64="jga3+qejeaE/IE5XCefToAaYZwA=">AAAB6nicbZDLSsNAFIZP6q3WW9Wlm8EiuCpJKcQuhKIblxV7gzaUyXTSDp1MwsxEKKGP4MaFIm59Ine+jdM0iLcfBj7+cw7nzO/HnClt2x9WYW19Y3OruF3a2d3bPygfHnVVlEhCOyTikez7WFHOBO1opjntx5Li0Oe058+ul/XePZWKRaKt5zH1QjwRLGAEa2PdtS9ro3LFrtqZ0F9wcqhArtao/D4cRyQJqdCEY6UGjh1rL8VSM8LpojRMFI0xmeEJHRgUOKTKS7NTF+jMOGMURNI8oVHmfp9IcajUPPRNZ4j1VP2uLc3/aoNEBxdeykScaCrIalGQcKQjtPw3GjNJieZzA5hIZm5FZIolJtqkU8pCaGRCK3DrOTScrxC6tapTr7q39UrzKo+jCCdwCufggAtNuIEWdIDABB7gCZ4tbj1aL9brqrVg5TPH8EPW2yfj1Y31</latexit> IR-VIC Policy Parameterization(at|!, st) =

<latexit sha1_base64="rYzZ28sfdkEY6mvhWOVS8AYQg/A=">AAAB/3icbZDLSgMxFIYz9VbrbVRw4ya0CBWlzEihdiEU3bisYC/QKUMmTdvQzIXkjFDaLnwHn8CNC0Xc+hru+jam0yLefgh8/P855PB7keAKLGtqpJaWV1bX0uuZjc2t7R1zd6+uwlhSVqOhCGXTI4oJHrAacBCsGUlGfE+whje4muWNOyYVD4NbGEas7ZNewLucEtCWax44Ec8TF8bYCX3WI6dYuXB84Zo5q2Alwn/BXkCuknVOHqaVYdU1P5xOSGOfBUAFUaplWxG0R0QCp4JNMk6sWETogPRYS2NAfKbao+T+CT7STgd3Q6lfADhxv2+MiK/U0Pf0pE+gr35nM/O/rBVD97w94kEUAwvo/KNuLDCEeFYG7nDJKIihBkIl17di2ieSUNCVZZISyonwHErFBZTtrxLqZwW7WCjd6DYu0VxpdIiyKI9sVEIVdI2qqIYoGqNH9IxejHvjyXg13uajKWOxs49+yHj/BM4EmLY=</latexit>Zdzt p(zt|st,!)p(at|st, zt)

<latexit sha1_base64="NEe2nSuCapxmRP6GB41rHxsGsXc=">AAACH3icbVDLSgNBEJz1bXxFPXoZFCGChF1RoyeDXjxGMFHIhmV20omDsw9mesW45iO8e/FXvHhQRLzlI/wHJ5sgvgoaaqq6me7yYyk02nbPGhkdG5+YnJrOzczOzS/kF5dqOkoUhyqPZKTOfaZBihCqKFDCeayABb6EM//yqO+fXYHSIgpPsRNDI2DtULQEZ2gkL7/rihBpk9546CJcY0q7ccE8bqn2cJO6UQBttkHjAvvSjLvh5dfsop2B/iXOkKyVD4o7B7cfdxUv/+42I54EECKXTOu6Y8fYSJlCwSV0c26iIWb8krWhbmjIAtCNNLuvS9eN0qStSJky22bq94mUBVp3At90Bgwv9G+vL/7n1RNs7TVSEcYJQsgHH7USSTGi/bBoUyjgKDuGMK6E2ZXyC6YYRxNpLgthPwMdkNL2kOw7XyHUtorOdrF0YtI4JANMkRWySgrEISVSJsekQqqEk3vySJ7Ji/VgPVmv1tugdcQaziyTH7B6n55/pXo=</latexit>

q(!|Sf = s2, s0)<latexit sha1_base64="F0wW8QA655e/eDLBk6hlKSAQZFg=">AAACAXicbZDLSgMxFIYz9VZbL1U3gptgFSpImSmF2oVQdOOyor1AOwyZNNOGZi4mmUIZ68ZXceNCERduXPoG7nwQXZtOi3j7IeHjP+dwkt8OGBVS19+0xMzs3PxCcjGVXlpeWc2srdeFH3JMathnPm/aSBBGPVKTVDLSDDhBrs1Iw+4fj+uNAeGC+t65HAbEdFHXow7FSCrLymxe5Nq+S7ro8sxy4CEUVmFfXfqelcnqeT0W/AvGFLKVnfenl0H6o2plXtsdH4cu8SRmSIiWoQfSjBCXFDMySrVDQQKE+6hLWgo95BJhRvEPRnBXOR3o+FwdT8LY/T4RIVeIoWurThfJnvhdG5v/1VqhdA7MiHpBKImHJ4uckEHpw3EcsEM5wZINFSDMqXorxD3EEZYqtFQcQjkWnECpOIWy8RVCvZA3ivnSqUrjCEyUBFtgG+SAAUqgAk5AFdQABlfgBtyBe+1au9UetMdJa0KbzmyAH9KePwHL2JoE</latexit>

<latexit sha1_base64="XpPmDc9AxS5QvMta/bVaR/XE0mg=">AAAB7XicbZDLSgMxFIYz9VbrrepSkGARXJUZKdSuLLjpzhbsBdqhZNJMG5tkhiQjlKFL925cKOLWV+hTuHDnM/gSpjNFvP0Q+PjPOZyT3wsZVdq2363M0vLK6lp2PbexubW9k9/da6kgkpg0ccAC2fGQIowK0tRUM9IJJUHcY6TtjS/m9fYNkYoG4kpPQuJyNBTUpxhpY7V6l5wMUT9fsIt2IvgXnAUUzl9njY/bw1m9n3/rDQIccSI0ZkiprmOH2o2R1BQzMs31IkVChMdoSLoGBeJEuXFy7RQeG2cA/UCaJzRM3O8TMeJKTbhnOjnSI/W7Njf/q3Uj7Z+5MRVhpInA6SI/YlAHcP51OKCSYM0mBhCW1NwK8QhJhLUJKJeEUEkEUyiXFlBxvkJonRadUrHcsAvVGkiVBQfgCJwAB5RBFdRAHTQBBtfgDjyARyuw7q0n6zltzViLmX3wQ9bLJ38pk/M=</latexit>

Zt<latexit sha1_base64="Z6zTC7uKpYbzwaJ6mGRLihzbU94=">AAAB6nicbZDLSgMxFIbP1Futt6rgxk2wCK7KjBRqd6VuXLZoL9gOJZOmbWgmMyQZoQx9BDcuFHHr1rfwCdy58VlMZ4p4+yHw8Z9zOCe/F3KmtG2/W5ml5ZXVtex6bmNza3snv7vXUkEkCW2SgAey42FFORO0qZnmtBNKin2P07Y3OZ/X2zdUKhaIKz0NqevjkWBDRrA21uV1X/fzBbtoJ0J/wVlAoXrQ+GAvtdd6P//WGwQk8qnQhGOluo4dajfGUjPC6SzXixQNMZngEe0aFNinyo2TU2fo2DgDNAykeUKjxP0+EWNfqanvmU4f67H6XZub/9W6kR6euTETYaSpIOmiYcSRDtD832jAJCWaTw1gIpm5FZExlphok04uCaGSCKVQLi2g4nyF0DotOqViuWHSqEGqLBzCEZyAA2WowgXUoQkERnAL9/BgcevOerSe0taMtZjZhx+ynj8BY2KR/A==</latexit>

At<latexit sha1_base64="QDi7w5R6USNhzDSURYIWm6haSSg=">AAAB6nicbZDLSgMxFIbP1Futt6rgxk2wCK7KjBRqd7VuXLZoL9AOJZNm2tBMZkgyQil9BDcuFHHr1rfwCdy58VlMZ4p4+yHw8Z9zOCe/F3GmtG2/W5ml5ZXVtex6bmNza3snv7vXUmEsCW2SkIey42FFORO0qZnmtBNJigOP07Y3vpjX2zdUKhaKaz2JqBvgoWA+I1gb6+q8r/v5gl20E6G/4CygUD1ofLCX2mu9n3/rDUISB1RowrFSXceOtDvFUjPC6SzXixWNMBnjIe0aFDigyp0mp87QsXEGyA+leUKjxP0+McWBUpPAM50B1iP1uzY3/6t1Y+2fuVMmolhTQdJFfsyRDtH832jAJCWaTwxgIpm5FZERlphok04uCaGSCKVQLi2g4nyF0DotOqViuWHSqEGqLBzCEZyAA2WowiXUoQkEhnAL9/BgcevOerSe0taMtZjZhx+ynj8BPUyR4w==</latexit>

St<latexit sha1_base64="pal9v5UmAZEbqFc35X+bMX269yQ=">AAAB6nicbZDLSgMxFIYz9VbrrSq4cRMsgqsyI4XaXakbly21F2iHkknTNjSTGZIzQhn6CG5cKOLWrW/hE7hz47OYzhTx9kPg4z/ncE5+LxRcg22/W5mV1bX1jexmbmt7Z3cvv3/Q1kGkKGvRQASq6xHNBJesBRwE64aKEd8TrONNLxf1zg1TmgfyGmYhc30ylnzEKQFjNZsDGOQLdtFOhP+Cs4RC9ajxwV9qr/VB/q0/DGjkMwlUEK17jh2CGxMFnAo2z/UjzUJCp2TMegYl8Zl24+TUOT41zhCPAmWeBJy43ydi4ms98z3T6ROY6N+1hflfrRfB6MKNuQwjYJKmi0aRwBDgxb/xkCtGQcwMEKq4uRXTCVGEgkknl4RQSYRTKJeWUHG+QmifF51SsdwwadRQqiw6RifoDDmojKroCtVRC1E0RrfoHj1YwrqzHq2ntDVjLWcO0Q9Zz59YuJH1</latexit>

T<latexit sha1_base64="QrQ1GNsYgEMGF4FfUXXu4F2aZ2Y=">AAAB6HicbZDJSgNBEIZrXGPcoh69NAbBU5iRQMwt6MVjAtkgGUJPpyZp07PQ3SOEIU/gxYMiXn0kb76Nnckgbj80fPxVRVX/Xiy40rb9Ya2tb2xubRd2irt7+weHpaPjrooSybDDIhHJvkcVCh5iR3MtsB9LpIEnsOfNbpb13j1KxaOwrecxugGdhNznjGpjtdqjUtmu2JnIX3ByKEOu5qj0PhxHLAkw1ExQpQaOHWs3pVJzJnBRHCYKY8pmdIIDgyENULlpduiCnBtnTPxImhdqkrnfJ1IaKDUPPNMZUD1Vv2tL87/aINH+lZvyME40hmy1yE8E0RFZ/pqMuUSmxdwAZZKbWwmbUkmZNtkUsxDqmcgKatUc6s5XCN3LilOt1FrVcuM6j6MAp3AGF+BADRpwC03oAAOEB3iCZ+vOerRerNdV65qVz5zAD1lvn/DbjXI=</latexit>

Model (agent)Inference (agent)Environment

Figure 2: Illustration of VIC for 2 timesteps. L: Given a startstate S0, VIC samples option ω and follows policy π(at | Ω =ω, st) and infers Ω from the terminating state (S2), optimizing alower bound on I(S2,Ω | S0). R: IR-VIC considers a particularparameterization of π and imposes a bottleneck on I(At,Ω|St).

2.3 Information Regularized VIC (IR-VIC)We identify sub-goals as states where the necessary optioninformation required for deciding actions is high. Formally,this means that at every timestep t in the trajectory, we mini-mize the mutual information I(Ω, At|St, S0 = s) , resultingin a sparse set of states where this mutual information remainshigh despite the minimization. Intuitively, this means thaton average (across different options), these states have higherrelevant option information that other states (e.g. the regionswith darker shades of red in Figure 1). Overall, our objectiveis to maximize:

JV IC(Ω, Sf ; s0)− β∑

t

I(Ω, At | St, S0 = s0) (2)

where β > 0 is a trade-off parameter. Thus, this is sayingthat one wants options Ω which allow the agent to have ahigh empowerment, while utilizing the least relevant optioninformation at each step.

Interestingly, Equation 2 has a clear, principled interpreta-tion in terms of the empowerment I(Ω, Sf |S0) from the VICmodel. We state the following lemma (which follows fromrecursively applying the chain rule of mutual information andthe data-processing inequality [Cover and Thomas, 1991]):Lemma 2.1. Let At be the action random variable attimestep t and state St following an option-conditioned pol-icy π(at|st, ω). Then, I(Ω, At|St, S0) i.e. the conditionalmutual information between the option Ω and actionAt whensummed over all timesteps in the trajectory, upper boundsthe conditional mutual information I(Ω, Sf |S0) between Ωand the final state Sf – namely the empowerment as definedby [Gregor et al., 2016]:

I(Ω, Sf |S0) ≤f∑

t=1

I(Ω, At|St, S0) = UDS(τ ,Ω, S0) (3)

Implications. With this lens, one can view the optimizationproblem in Equation 2 as a Lagrangian relaxation of the fol-lowing constrained optimization problem:

max JV IC s.t. UDS ≤ R (4)

where R > 0 is a constant. While upper bounding the em-powerment does not directly imply one will find useful sub-goals (meaning it is the structure of the decomposition eq. (3)that is more relevant than the fact that it is an upper bound),this bound might be of interest more generally for represen-tation learning [Achiam et al., 2018; Gregor et al., 2016].


2023

Targeting specific values for the upper bound R can poten-tially allow us to control how ‘abstract’ or invariant the la-tent option representation is relative to the states Sf , leadingto solutions that say, neglect unnecessary information in thestate representation to allow better generalization. Note thatmost approaches currently limit the abstraction by constrain-ing the number of discrete options, which (usually) imposesan upper bound on I(Ω, Sf ) = H(Ω) − H(Ω|Sf ), sinceH(Ω) ≥ H(Ω|Sf ) and H ≥ 0 in the discrete case. How-ever, this does not hold for the continuous case, where thisresult might be more useful. Investigating this is beyond thescope of this current paper, however, as our central aim is toidentify useful sub-goals, and not to scale the VIC frameworkto continuous options.

2.4 Algorithmic Details

Upper Bounds for I(Ω, At | St, S0)

Inspired by InfoBot [Goyal et al., 2019], we bottleneck theinformation in a statistic Zt of the state St and option Ω usedto parameterize the policy π(At | Ω, St) (fig. 2 right). Thisis justified by the the data-processing inequality [Cover andThomas, 1991] for the markov chain Ω, St ↔ Zt ↔ At,which implies I(Ω, At | St, S0) ≤ I(Ω, Zt | St, S0). We canthen obtain the following upper bound on I(Ω, Zt | St, S0): 1

I(Ω, Zt | St, S0 = s)

≤ E Ω∼p(ω)

St∼pJ (st|Ω,S0=s)Zt∼p(zt|St,Ω)

[log

p(Zt | Ω, St)q(Zt)

](5)

where q(zt) is a fixed variational approximation (set toN (0, I) as in InfoBot), and pφ(zt | ω, st) is a parameterizedencoder. As explained in section 1, the key difference be-tween eq. (5) and InfoBot is that they construct upper boundson I(G,At | St, S0) using information about the goal G,while we bottleneck the option-information. One could usethe DIAYN objective [Eysenbach et al., 2018] (see more be-low under related objectives) which also has a I(At,Ω|St)term, and directly bottleneck the action-option mutual infor-mation instead of eq. (5), but we found that directly imposingthis bottleneck often hurt convergence in practice.

We can compute a Monte Carlo estimate of Equation 5 byfirst sampling an option ω at s0 and then keeping track of allstates visited in trajectory τ . In addition to the VIC term andour bottleneck regularizer, we also include the entropy of thepolicy over the actions (maximum-entropy RL [Ziebart et al.,2008]) as a bonus to encourage sufficient exploration. Wefix the coefficient for maximum-entropy, α = 10−3 whichconsistently works well for our approach as well as baselines.

1Similar to VIC, pJ here denotes the (unknown) state distributionat time t from which we can draw samples when we execute a policy.We then assume a variational approximation q(zt) (fixed to be a unitgaussian) for p(zt|St). Using the fact thatDKL(p(zt|st)||q(zt)) ≥ 0we get the derived upper bound.

Overall, the IR-VIC objective is:

maxθ,φ,ν

J(θ, φ, ν) = E Ω∼p(ω)τ∼π(·|ω,S0)Zt∼pφ(zt|St,Ω)

[log

qν(Ω | Sf , S0)

p(Ω)

−f−1∑

t=0

(β log

pφ(Zt | St,Ω)

q(Zt)+ α log πθ(At | St, Zt)

)]

(6)

where θ, φ and ν are the parameters of the policy, latent vari-able decoder and the option inference network respectively.The first term in the objective promotes high empowermentwhile learning options; the second term ensures minimality inusing the options sampled to take actions and the third pro-vides an incentive for exploration.

Related ObjectivesDIAYN [Eysenbach et al., 2018] attempts to learn skills(similar to options) which can control the states visited byagents while ensuring that all visited states, as opposed totermination states, are used to distinguish skills. Thus, foran option Ω and every state St in a trajectory, they maxi-mize

∑t I(Ω, St) − I(At,Ω|St) + H(At|St), as opposed

to I(Ω, Sf ) − β∑t I(At,Ω|St) + H(At|St) in our objec-

tive. With the sum over all timesteps for I(Ω, St), the boundin lemma 2.1 no longer holds true, which also means thatthere is no principled reason (unlike our model) to scalethe second term with β. The most closely related workto ours is InfoBot [Goyal et al., 2019], which maximizes∑tR(t) − βI(Zt, G|St) for a goal (G) conditioned policy

π(at|St, G). They define states where I(Zt, G|St) is high de-spite the bottleneck as “decision states”. The key differenceis that InfoBot requires extrinsic rewards in order to identifygoal-conditioned decision states, while our work is strictlymore general and scales in the absence of extrinsic rewards.

Further, in context of both these works, our work provides aprincipled connection between action-option information re-qularization I(At,Ω|St) and empowerment of an agent. Thetools from Lemma 2.1 might be useful for analysing theseprevious objectives which both employ this technique.

2.5 Transfer to Goal-Driven TasksIn order to transfer sub-goals to novel environments, [Goyalet al., 2019] pretrain their model to identify their goal-conditioned decision states, and then study if identifying sim-ilar states in new environments can improve exploration whentraining a new policy πγ(a|s, g) from scratch. Given an envi-ronment with reward Re(t), goal G, κ > 0, and state visita-tion count c(St), their reward is:

Rt = Re(t) +κ√c(St)

I(G,Zt|St)︸︷︷︸Pretrained, Frozen

(7)

The count-based reward decays with square root of c(St)to encourage the model to explore novel states, and the mu-tual information between goal G and bottleneck variable Ztis a smooth measure of whether a state is a sub-goal, and ismultiplied with the exploration bonus to encourage visitationof state where this measure is high.


2024

upper bound

0.0I(, At|St)

<latexit sha1_base64="wbqjPiejeqlqbw90Pot15A9Ds3M=">AAAB/HicbZBLSwMxFIXv+La+qi7dBEWwIGVGCrW7ihu7sqLVQluGTJrW0MyD5I5Qxor/xI0LRdz6Q9z5b8xMi/g6EPg4515yOV4khUbb/rCmpmdm5+YXFnNLyyura/n1jUsdxorxBgtlqJoe1VyKgDdQoOTNSHHqe5JfeYPjNL+64UqLMLjAYcQ7Pu0HoicYRWO5+c3aXvvU5326T45cvCXnLhbc/I5dtDORv+BMYKdauIdUdTf/3u6GLPZ5gExSrVuOHWEnoQoFk3yUa8eaR5QNaJ+3DAbU57qTZMePyK5xuqQXKvMCJJn7fSOhvtZD3zOTPsVr/TtLzf+yVoy9w04igihGHrDxR71YEgxJ2gTpCsUZyqEBypQwtxJ2TRVlaPrKZSVUMpExlEsTqDhfJVweFJ1SsXxm2qjBWAuwBduwBw6UoQonUIcGMBjCAzzBs3VnPVov1ut4dMqa7GzCD1lvnw9DlPA=</latexit>

DIAYN<latexit sha1_base64="PGEEGtdum/5Y9HJq9aDJvoMDBxc=">AAAB9XicbZBLS8NAFIUn9VXjq+rSzWARXJVEhNqFWB8L3UgF+5A2lsl00g6dPJi5UUvo/3Djwgdu/Rnu3Yj/xjQp4uvAwMc59zKXYweCKzCMDy0zMTk1PZOd1efmFxaXcssrNeWHkrIq9YUvGzZRTHCPVYGDYI1AMuLagtXt/uEor18xqbjvncMgYJZLuh53OCUQW5ctYDcAEB2d7F+cDtu5vFEwEuG/YI4hv/eq7waP73qlnXtrdXwauswDKohSTdMIwIqIBE4FG+qtULGA0D7psmaMHnGZsqLk6iHeiJ0OdnwZPw9w4n7fiIir1MC140mXQE/9zkbmf1kzBGfHirgXhMA8mn7khAKDj0cV4A6XjIIYxECo5PGtmPaIJBTiovSkhFIinEJxewwl86uE2lbB3C4Uz4x8+QClyqI1tI42kYmKqIyOUQVVEUUS3aJ79KBda3fak/acjma08c4q+iHt5RMd35ZT</latexit>

1.0

Four Room

Maze DIAYN<latexit sha1_base64="PGEEGtdum/5Y9HJq9aDJvoMDBxc=">AAAB9XicbZBLS8NAFIUn9VXjq+rSzWARXJVEhNqFWB8L3UgF+5A2lsl00g6dPJi5UUvo/3Djwgdu/Rnu3Yj/xjQp4uvAwMc59zKXYweCKzCMDy0zMTk1PZOd1efmFxaXcssrNeWHkrIq9YUvGzZRTHCPVYGDYI1AMuLagtXt/uEor18xqbjvncMgYJZLuh53OCUQW5ctYDcAEB2d7F+cDtu5vFEwEuG/YI4hv/eq7waP73qlnXtrdXwauswDKohSTdMIwIqIBE4FG+qtULGA0D7psmaMHnGZsqLk6iHeiJ0OdnwZPw9w4n7fiIir1MC140mXQE/9zkbmf1kzBGfHirgXhMA8mn7khAKDj0cV4A6XjIIYxECo5PGtmPaIJBTiovSkhFIinEJxewwl86uE2lbB3C4Uz4x8+QClyqI1tI42kYmKqIyOUQVVEUUS3aJ79KBda3fak/acjma08c4q+iHt5RMd35ZT</latexit>

= 1<latexit sha1_base64="wYH9vwUlGCQfh06+Y/PaUKJLH6s=">AAAB+3icbZBLS8NAFIUn9VXrK9alm8EiuCqJFGoXQtGNywr2AU0ok+ltO3TyYOZGWkL/ihsXirj1j7jz35imQXwdGPg4517mcrxICo2W9WEU1tY3NreK26Wd3b39A/Ow3NFhrDi0eShD1fOYBikCaKNACb1IAfM9CV1ver3Mu/egtAiDO5xH4PpsHIiR4AxTa2CWHQ+Q0UvqIMwQMbEXA7NiVa1M9C/YOVRIrtbAfHeGIY99CJBLpnXftiJ0E6ZQcAmLkhNriBifsjH0UwyYD9pNstsX9DR1hnQUqvQFSDP3+0bCfK3nvpdO+gwn+ne2NP/L+jGOLtxEBFGMEPDVR6NYUgzpsgg6FAo4ynkKjCuR3kr5hCnGMa2rlJXQyERXUK/l0LC/SuicV+1atX5bqzSv8jqK5JickDNikzppkhvSIm3CyYw8kCfybCyMR+PFeF2NFox854j8kPH2CamBlKA=</latexit>

= 1e-3<latexit sha1_base64="1hYcp1KdAAmIWe6zflshUyLlXlQ=">AAAB/nicbZBLS8NAFIUn9VXrqyqu3AwWwY0l0ULtQii6cVnBPqANZTK9aYdOHszciCUU/CtuXCji1t/hzn9jmgbxdWDg45x7mctxQik0muaHkVtYXFpeya8W1tY3NreK2zstHUSKQ5MHMlAdh2mQwocmCpTQCRUwz5HQdsaXs7x9C0qLwL/BSQi2x4a+cAVnmFj94l7PAWT0nPYQ7hAxtuD4dNovlsyymYr+BSuDEsnU6Bffe4OARx74yCXTumuZIdoxUyi4hGmhF2kIGR+zIXQT9JkH2o7T86f0MHEG1A1U8nykqft9I2ae1hPPSSY9hiP9O5uZ/2XdCN0zOxZ+GCH4fP6RG0mKAZ11QQdCAUc5SYBxJZJbKR8xxTgmjRXSEmqp6ByqlQxq1lcJrZOyVSlXryul+kVWR57skwNyRCxSJXVyRRqkSTiJyQN5Is/GvfFovBiv89Gcke3skh8y3j4BV+KVgw==</latexit>

= 1<latexit sha1_base64="wYH9vwUlGCQfh06+Y/PaUKJLH6s=">AAAB+3icbZBLS8NAFIUn9VXrK9alm8EiuCqJFGoXQtGNywr2AU0ok+ltO3TyYOZGWkL/ihsXirj1j7jz35imQXwdGPg4517mcrxICo2W9WEU1tY3NreK26Wd3b39A/Ow3NFhrDi0eShD1fOYBikCaKNACb1IAfM9CV1ver3Mu/egtAiDO5xH4PpsHIiR4AxTa2CWHQ+Q0UvqIMwQMbEXA7NiVa1M9C/YOVRIrtbAfHeGIY99CJBLpnXftiJ0E6ZQcAmLkhNriBifsjH0UwyYD9pNstsX9DR1hnQUqvQFSDP3+0bCfK3nvpdO+gwn+ne2NP/L+jGOLtxEBFGMEPDVR6NYUgzpsgg6FAo4ynkKjCuR3kr5hCnGMa2rlJXQyERXUK/l0LC/SuicV+1atX5bqzSv8jqK5JickDNikzppkhvSIm3CyYw8kCfybCyMR+PFeF2NFox854j8kPH2CamBlKA=</latexit>

= 1e-3<latexit sha1_base64="1hYcp1KdAAmIWe6zflshUyLlXlQ=">AAAB/nicbZBLS8NAFIUn9VXrqyqu3AwWwY0l0ULtQii6cVnBPqANZTK9aYdOHszciCUU/CtuXCji1t/hzn9jmgbxdWDg45x7mctxQik0muaHkVtYXFpeya8W1tY3NreK2zstHUSKQ5MHMlAdh2mQwocmCpTQCRUwz5HQdsaXs7x9C0qLwL/BSQi2x4a+cAVnmFj94l7PAWT0nPYQ7hAxtuD4dNovlsyymYr+BSuDEsnU6Bffe4OARx74yCXTumuZIdoxUyi4hGmhF2kIGR+zIXQT9JkH2o7T86f0MHEG1A1U8nykqft9I2ae1hPPSSY9hiP9O5uZ/2XdCN0zOxZ+GCH4fP6RG0mKAZ11QQdCAUc5SYBxJZJbKR8xxTgmjRXSEmqp6ByqlQxq1lcJrZOyVSlXryul+kVWR57skwNyRCxSJXVyRRqkSTiJyQN5Is/GvfFovBiv89Gcke3skh8y3j4BV+KVgw==</latexit>

e 1.59<latexit sha1_base64="3c8BD783QA/Fd6/P1HtpB+uouQ4=">AAAB/nicbZBLS8NAFIUn9R1fVXHlZrAIrkIildqFKLpxqWBVaEOZTG916CQZZm6kJRT8KyK4UMSt/8C9G/HfmKZFfB0Y+DjnXuZyAiWFQdf9sApj4xOTU9Mz9uzc/MJicWn5zMSJ5lDjsYz1RcAMSBFBDQVKuFAaWBhIOA86h4P8/Bq0EXF0ij0FfsguI9EWnGFmNYur0GBK6bjbQOgiYuo529V+s1hyHTcX/QveCEp7r/auunu3j5vFt0Yr5kkIEXLJjKl7rkI/ZRoFl9C3G4kBxXiHXUI9w4iFYPw0P79PNzKnRduxzl6ENHe/b6QsNKYXBtlkyPDK/M4G5n9ZPcH2jp+KSCUIER9+1E4kxZgOuqAtoYGj7GXAuBbZrZRfMc04Zo3ZeQnVXHQIlfIIqt5XCWdbjld2Kiduaf+ADDVN1sg62SQeqZB9ckSOSY1wkpJb8kAerRvr3nqynoejBWu0s0J+yHr5BOIFmX0=</latexit>

e 2.61<latexit sha1_base64="R0br92gxG7Lpy+DG1tXlU3dePUU=">AAAB/nicbZBLS8NAFIUnPmt8VcWVm8EiuAqJFKsLsejGZQVbhTaUyfRWByfJMHMjllDwr4jgQhG3/gP3bsR/Y5oW8XVg4OOce5nLCZQUBl33wxobn5icmi7M2LNz8wuLxaXlhokTzaHOYxnrs4AZkCKCOgqUcKY0sDCQcBpcHg7y0yvQRsTRCfYU+CE7j0RXcIaZ1S6uQosppePrFsI1IqZbzrbXbxdLruPmon/BG0Fp/9XeU3fvdq1dfGt1Yp6EECGXzJim5yr0U6ZRcAl9u5UYUIxfsnNoZhixEIyf5uf36UbmdGg31tmLkObu942Uhcb0wiCbDBlemN/ZwPwvaybY3fFTEakEIeLDj7qJpBjTQRe0IzRwlL0MGNciu5XyC6YZx6wxOy9hNxcdQqU8gl3vq4TGluOVncqxW6oekKEKZI2sk03ikQqpkiNSI3XCSUpuyQN5tG6se+vJeh6OjlmjnRXyQ9bLJ9jrmXc=</latexit>

e 0.51<latexit sha1_base64="yJ0RXAoYTx3IUQY0KvS9AEoNT1o=">AAAB/nicbZBLS8NAFIUnPmt8VcWVm8EiuAqJVKoLsejGZQVbhTaUyfRWByfJMHMjllDwr4jgQhG3/gP3bsR/Y5oW8XVg4OOce5nLCZQUBl33wxobn5icmi7M2LNz8wuLxaXlhokTzaHOYxnrs4AZkCKCOgqUcKY0sDCQcBpcHg7y0yvQRsTRCfYU+CE7j0RXcIaZ1S6uQosppePrFsI1Iqaus+3128WS67i56F/wRlDaf7X31N27XWsX31qdmCchRMglM6bpuQr9lGkUXELfbiUGFOOX7ByaGUYsBOOn+fl9upE5HdqNdfYipLn7fSNloTG9MMgmQ4YX5nc2MP/Lmgl2d/xURCpBiPjwo24iKcZ00AXtCA0cZS8DxrXIbqX8gmnGMWvMzkvYzUWHUCmPYNf7KqGx5Xhlp3LslqoHZKgCWSPrZJN4pEKq5IjUSJ1wkpJb8kAerRvr3nqynoejY9ZoZ4X8kPXyCdRVmXQ=</latexit>

e 2.05<latexit sha1_base64="wrlqcc0YchRJDCXR8jfY0JAaAxk=">AAAB/nicbZBLS8NAFIUnPmt8VcWVm8EiuAqJVKoLsejGZQVbhTaUyfRWByfJMHMjllDwr4jgQhG3/gP3bsR/Y5oW8XVg4OOce5nLCZQUBl33wxobn5icmi7M2LNz8wuLxaXlhokTzaHOYxnrs4AZkCKCOgqUcKY0sDCQcBpcHg7y0yvQRsTRCfYU+CE7j0RXcIaZ1S6uQosppePrFsI1IqZbjrvdbxdLruPmon/BG0Fp/9XeU3fvdq1dfGt1Yp6EECGXzJim5yr0U6ZRcAl9u5UYUIxfsnNoZhixEIyf5uf36UbmdGg31tmLkObu942Uhcb0wiCbDBlemN/ZwPwvaybY3fFTEakEIeLDj7qJpBjTQRe0IzRwlL0MGNciu5XyC6YZx6wxOy9hNxcdQqU8gl3vq4TGluOVncqxW6oekKEKZI2sk03ikQqpkiNSI3XCSUpuyQN5tG6se+vJeh6OjlmjnRXyQ9bLJ9XbmXU=</latexit>

e 1.19<latexit sha1_base64="to+ZvAp2EtzE0IHU7jHioSWIpnE=">AAAB/nicbZBLS8NAFIUnPmt8RcWVm8EiuCqJFGoXoujGZQXbCm0pk+mtHTpJhpkbsYSCf0UEF4q49R+4dyP+G9O0iK8DAx/n3Mtcjq+kMOi6H9bU9Mzs3HxuwV5cWl5ZddbWayaKNYcqj2SkL3xmQIoQqihQwoXSwAJfQt3vn4zy+hVoI6LwHAcKWgG7DEVXcIap1XY2ocmU0tF1E+EaEROv4JWHbSfvFtxM9C94E8gfvtoH6u7drrSdt2Yn4nEAIXLJjGl4rsJWwjQKLmFoN2MDivE+u4RGiiELwLSS7Pwh3UmdDu1GOn0h0sz9vpGwwJhB4KeTAcOe+Z2NzP+yRozd/VYiQhUjhHz8UTeWFCM66oJ2hAaOcpAC41qkt1LeY5pxTBuzsxLKmegYSsUJlL2vEmp7Ba9YKJ25+aNjMlaObJFtsks8UiJH5JRUSJVwkpBb8kAerRvr3nqynsejU9ZkZ4P8kPXyCdvtmXk=</latexit>

e 2.22<latexit sha1_base64="KPe/tRDecfCRShflIKgl29o3JHQ=">AAAB/nicbZBLS8NAFIUn9R1fVXHlZrAIrkpSCrULUXTjUsHaQlvKZHrbDp0kw8yNtISCf0UEF4q49R+4dyP+G9O0iK8DAx/n3MtcjqekMOg4H1ZmZnZufmFxyV5eWV1bz25sXpkw0hwqPJShrnnMgBQBVFCghJrSwHxPQtXrn47z6jVoI8LgEocKmj7rBqIjOMPEamW3ocGU0uGggTBAxLiQLxRGrWzOyTup6F9wp5A7erUP1d27fd7KvjXaIY98CJBLZkzddRQ2Y6ZRcAkjuxEZUIz3WRfqCQbMB9OM0/NHdC9x2rQT6uQFSFP3+0bMfGOGvpdM+gx75nc2Nv/L6hF2DpqxCFSEEPDJR51IUgzpuAvaFho4ymECjGuR3Ep5j2nGMWnMTksop6ITKBWnUHa/Srgq5N1ivnTh5I5PyESLZIfskn3ikhI5JmfknFQIJzG5JQ/k0bqx7q0n63kymrGmO1vkh6yXT9RYmXQ=</latexit>

Figure 3: Heatmaps of necessary option informationI(Ω, Zt|St, S0) (normalized to [0,1]) at visited states on en-vironments – 4-Room (top) and maze (bottom). First columndepicts environment layout, second and third show results forIR-VIC, β = 1e−3 and β = 1 respectively, and the fourth columnshows DIAYN. e-values show computed lower bounds (with Eq. 1)on empowerment in nats.

We use an almost identical setup, replacing their decision-state term from supervised pretraining with necessary optioninformation from IR-VIC pretraining:

Rt = Re(t) +κ√c(St)

I(Ω, Zt|St, S0)︸︷︷︸Pretrained, Frozen

(8)

I(·) is computed with eq. (5) with a frozen parameterized en-coder p(zt | ω, st) during transfer. Thus, we incentivize visi-tation of states where necessary option information is high.

Algorithm. In summary, IR-VIC consists of two majorphases – (1) Unsupervised Discovery and (2) Transfer. Inphase (1), we optimize Eqn. 6 based on unrolled trajectoriesconditioned on the initially sampled option to update the pa-rameterized encoder pφ(zt|ω, xt), option-conditioned policyπθ(at|ω, xt) and the option inference network qν(ω|s0, sf ).In phase (2), we freeze the pretrained encoder pφ(zt|ω, xt)and use it as an added exploration incentive to learn the goalconditioned policy πγ(at|xt, g) (see Eqn. 8).

2.6 IR-VIC for TransferOptions with partial observability. The methods we havedescribed so far have assumed the true state s ∈ S to beknown – the VIC framework with explicit options has onlybeen shown to work in fully obervable MDPs [Gregor et al.,2016; Eysenbach et al., 2018]. However, since we are primar-ily interested improved exploration in downstream partially-observable tasks, we adapt the VIC framework to only usepartially-observable information for the parts that we useduring transfer. We design our policy (including the en-coder p(Zt | Ω, St) used for computing the reward bonusI(Ω, Zt | St, S0) during transfer) to take as input partial ob-servations x ∈ X while allowing the option inference net-works (of IR-VIC and DIAYN) to take as input the global(x, y) coordinates of the agent (assuming access to the truestate s ∈ S . Note that this privileged information is madeavailable for a single environment in order to discover sub-goals transerable to multiple novel environments (whereas

supervised methods such as InfoBot [Goyal et al., 2019] re-quire global (x, y) coordinates as goal information acrossall training environments).

Preventing option information leak. We parameterizep(at|zt, st) (fig. 2, right) using just the current state st,whereas the encoder p(zt|Ω, (s1, · · · , st)) uses all previousstates since a sequence of state observations (s1, · · · , st)could potentially be very informative of the Ω being followed,which if provided directly to p(at|·) can lead to a leakage ofthe option information to the actions, rendering the bottleneckon option information imposed via zt useless. Hence, in ourimplementation we remove recurrence over partial observa-tions for p(at|zt, st) while keeping it in p(zt | Ω, st).

3 ExperimentsEnvironments. We pre-train and test on grid-worlds fromthe MiniGrid [Chevalier-Boisvert et al., 2018] environments.We first consider a set of simple environments – 4-Room andMaze (see Fig. 3) followed by the MultiRoomNXSY alsoused by [Goyal et al., 2019]. The MultiRoomNXSY envi-ronments consist of X rooms of size Y, connected in randomorientations. We refer to the ordering of rooms, doors andgoal as a ‘layout’ in the MultiRoomNXSY environment –pre-training of options (for IR-VIC and DIAYN) is performedon a single fixed layout while transfer is performed on severaldifferent layouts (a layout is randomly selected from a set ev-ery time the environment is reset). In all pre-training environ-ments, we fix the option trajectory length H (the number ofsteps an option takes before termination) to 30 steps.

We use Advantage Actor-Critic (A2C) for all experiments.Since code for InfoBot [Goyal et al., 2019] was not public, wereport numbers based on a re-implementation of InfoBot, en-suring consistency with their architectural and hyperparam-eter choices. We refer the readers to our code2 for furtherdetails.

Baselines. We evaluate the following on quality of explo-ration and transfer to downstream goal-driven tasks withsparse rewards: 1) InfoBot (our implementation) – whichidentifies goal-driven decision states by regularizing goal in-formation, 2) DIAYN – whose focus is unsupervised skill ac-quisition, but has an I(At,Ω|St) term which can be used forthe bonus in Equation 8, 3) count-based exploration whichuses visitation counts as exploration incentive (this corre-sponds to replacing I(Ω, Zt|St, S0) with 1 in Equation 8), 4)a randomly initialized encoder p(zt | ω, xt) which is a noisyversion of the count-based baseline where the scale of the re-ward is adjusted to match the count-based baseline 5) howdifferent values of β affect performance and how we choosea β value using a validation set, and 6) a heuristic baselinethat uses domain knowledge to identify landmarks such ascorners and doorways and provide a higher count-based ex-ploration bonus to these states. This validates the extent towhich necessary option information is useful in identifying asparse set of states that are useful for transfer vs. heuristicallydetermined landmarks.

2https://github.com/nirbhayjm/irvic


2025

https://github.com/nirbhayjm/irvic

Method MR-N3S4 MR-N5S4 MR-N6S25pφ(Zt|St,Ω) pretrained on MR-N2S6 MR-N2S6 MR-N2S10

InfoBot [Goyal et al., 2019] 90% 85% N/A

InfoBot (Our Implementation) 99.9%±0.1% 79.1%±11.6% 9.9%±1.2%Count-based Baseline 99.7%±0.1% 99.7%±0.1% 86.8%±2.2%DIAYN 99.7%±0.1% 95.4%±4.1% 0.1%±0.1%Random Network 99.9%±0.1% 98.8%±0.7% 79.5%±5.2%Heuristic Baseline N/A N/A 85.9%±3.0%Ours (β = 10−2) 99.3%±0.3% 99.4%±0.2% 92.9%±1.2%

Table 1: Success rate (mean ± standard error) of the goal-conditioned policy when trained with different exploration bonusesin addition to the extrinsic rewardRe(t). We report results at 5×105

timesteps for MultiRoom (MR-NXSY) N3S4, N5S4 and at 107

timesteps for N6S25. We also report the performance of InfoBotfor completeness. Note that for rooms of size 4 (N3S4, N5S4), in-centivizing to visit corners and doorways (Heuristic Sub-goals) isequivalent to count-based exploration.

3.1 Qualitative ResultsFigure 3 shows heatmaps of necessary option informationI(Ω, At|St, S0) on 4-Room and maze grid world environ-ments where the initial state is sampled uniformly at random.Stronger regularization (β = 1) leads to poorer empowermentmaximization and in some cases not learning any options (andI(Ω, At|St, S0) collapses to 0 at all states). At lower valuesof β = 1e-3, we get more discernible states with distinctlyhigh necessary option information. Finally, for maze we seethat for a similar value of empowerment3, IR-VIC leads to amore peaky distribution of states with high necessary optioninformation than DIAYN.

3.2 Quantitative ResultsTransfer to Goal-driven TasksNext, we evaluate Equation 8, i.e. whether providing vis-itation incentive proportional to necessary option informa-tion at a state in addition to sparse extrinsic reward can aidin transfer to goal-driven tasks in different environments.We restrict ourselves to the point-navigation task [Goyal etal., 2019] transfer in the MultiRoomNXSY set of partially-observable environments. In this task, the agent learns a pol-icy π(at|gt, st) where gt is the vector pointing to the goalfrom agent’s current location at every time step t. The initialstate is always the first room, and has to go to a randomlysampled goal location in the last room and is rewarded onlywhen it reaches the goal. [Goyal et al., 2019] test the efficacyof different exploration objectives 4 and show that this is ahard setting where efficient exploration is necessary. Theyshow that InfoBot outperforms several state-of-the art explo-ration methods in this environment.

Concretely, we 1) train IR-VIC to identify sub-goals(Equation 2) on MultiRoomN2S6 and transfer to a goal-driven task on MultiRoomN3S4 and MultiRoomN5S4(similar to [Goyal et al., 2019]), and 2) train on

3Since DIAYN maximizes the mutual information with everystate in a trajectory, we report the empowerment for the state withmaximum mutual information with the option.

4While our focus is on identifying and probing how good sub-goals from intrinsic training are, more broader comparisons to ex-ploration baselines are in InfoBot [Goyal et al., 2019].

Figure 4: Transfer results on a test set of MultiRoomN6S25 environ-ment layouts after unsupervised pre-training on MultiRoomN2S10.Shaded regions represent standard error of the mean of average re-turn over 10 random seeds.

MultiRoomN2S10 and transfer to MultiRoomN6S25,which is a more challenging transfer task i.e. it has a largerupper limit on room size making efficient exploration crit-ical to find doors quickly. For IR-VIC and DIAYN (thetwo methods that learn options), we pre-train on a single lay-out of the corresponding MultiRoom environment for 106

episodes and pick the checkpoints with highest empowermentvalues across training. For InfoBot (no option learning re-quired), we pre-train as per [Goyal et al., 2019] on multiplelayouts of the MultiRoom environment. Transfer perfor-mance of all methods is reported on a fixed test set of mul-tiple MultiRoom environment layouts and hyperparametersacross all methods, e.g. β for IR-VIC and InfoBot are selectedusing a validation set of MultiRoom environment layouts.

Overall TrendsTable 1 reports success rate – the % of times the agent reachesthe goal and Figure 4 reports the average return when learningto navigate on test environments. The MultiRoomN6S25environment provides a sparse decaying reward upon reach-ing the goal – implying that when comparing methods, highersuccess rate (Table 1) indicates that the goal is reached moreoften, and higher return values (Figure 4) indicate that thegoal is reached with fewer time steps.

First, our implementation of InfoBot is competitivewith [Goyal et al., 2019]5. Next, for the MultiRoomN2S6to N5S4 transfer (middle column), baselines as well as sub-goal identification methods perform well with some mod-els having overlapping confidence intervals despite low suc-cess means. In MultiRoomN2S10 to N6S25 transfer, wherethe latter has a large state space, we find that IR-VIC (atβ = 10−2) achieves the best sample complexity (in termsof average return) and final success, followed closely by In-foBot. Moreover, we find that the heuristic baseline whichidentifies a sparse set of landmarks (to mimic sub-goals) doesnot perform well – indicating that it is not easy to hand-specify sub-goals that are useful for the given transfer task.Finally, the randomly initialized encoder as well as DIAYN

5We found it important to run all models (inlcuding InfoBot) anorder of magnitude more steps compared to [Goyal et al., 2019],but our models also appear to converge to higher success values.


2026

Figure 5: Evaluation of average return on a held-out validation set ofMultiRoomN6S25 environment layouts. For each value of β, pre-traning is performed over 3 random seeds with the best seed beingpicked to measure transfer performance over 3 subsequent randomseeds. Shaded regions represent standard error of the mean over the3 random seeds used for transfer.

generalize much worse in this transfer task.

β sensitivity. We sweep over β in log-scale from10−1, . . . , 10−6, as shown in Figure 5 (except β = 10−1

which does not converge to > 0 empowerment) and also re-port β = 0 which recovers a no information regularizationbaseline. We find that 10−2 works best – with performancetailing off at lesser values. This is intuitive, since for a reallylarge value of β, one does not learn any options (as the em-powerment is too low), while for a really small value of β,one might not be able to target necessary option information,getting large “sufficient” (but not necessary) option informa-tion for the underlying option-conditioned policy.

We pick the best model for transfer based on performanceon the validation environments, and study generalization tonovel test environments. Choosing the value of β in this set-ting is thus akin to model selection. Such design choices areinherent in general in unsupervised representation learning(e.g. with K-means and β-VAE [Higgins et al., 2017]).

4 Related WorkIntrinsic control and intrinsic motivation. Learning howto explore without extrinsic rewards is a foundational prob-lem in Reinforcement Learning [Pathak et al., 2017; Gregoret al., 2016; Schmidhuber, 1990]. Typical curiosity-drivenapproaches attempt to visit states that maximize the surpriseof an agent [Pathak et al., 2017] or improvement in predic-tions from a dynamics model [Lopes et al., 2012]. Whilecuriosity-driven approaches seek out and explore novel states,they typically do not measure how reliably the agent canreach them. In contrast, approaches for intrinsic control [Ey-senbach et al., 2018; Achiam et al., 2018; Gregor et al., 2016]explore novel states while ensuring those states are reliablyreachable. [Gregor et al., 2016] maximize the number of finalstates that can be reliably reached by the policy, while [Eysen-bach et al., 2018] distinguish an option (which they refer toas a ‘skill’) at every state along the trajectory, and [Achiamet al., 2018] learn options for entire trajectories by encodinga sub-sequence of states, sampled at regular intervals. Sincewe wish to learn to identify useful sub-goals which one can

reach reliably acting in an environment rather than just vis-iting novel states (without an estimate of reachability), weformulate our regularizer in the intrinsic control framework,specifically building on the work of [Gregor et al., 2016].

Default behavior and decision states. Recent work in pol-icy compression has focused on learning a default policywhen training on a family of tasks, to be able to re-use be-havior across tasks. In [Teh et al., 2017], default behavior islearnt using a set of task-specific policies which then regular-izes each policy, while [Goyal et al., 2019] learn a default pol-icy using an information bottleneck on task information and alatent variable the policy conditions on, identifying sub-goalswhich they term as “decision states”. We devise a similar in-formation regularization objective that learns default behav-ior shared by all intrinsic options without external rewards soas to reduce learning pressure on option-conditioned policies.Different from these previous approaches, our approach doesnot need any explicit reward specification when learning op-tions (ofcourse, since we care about transfer we still need todo model selection based on validation environments).

Bottleneck states in MDPs. There is rich literature onidentification of bottleneck states in MDPs. The core ideais to either identify all the states that are common to multi-ple goals in an environment [McGovern and Barto, 2001]or use a diffusion model built using an MDP’s transition ma-trix [Machado et al., 2017]. The key distinction between bot-tleneck states and necessary-information based sub-goals isthat the latter are more closely tied to the information avail-able to the agent and what it can act upon, whereas bottleneckstates are more tied to the connectivity structure of an MDPand intrinsic to the environment, representing states whichwhen visited allow access to a novel set of states [Goyal etal., 2019]. However, bottleneck states do not easily apply topartially observed environments and when the transition dy-namics of the MDP are not known.

Information bottleneck in machine learning. Since thefoundational work of [Tishby et al., 1999; Chechik et al.,2005], there has been a lot of interest in making use ofideas from information bottleneck (IB) for various taskssuch as clustering [Strouse and Schwab, 2017; Still et al.,2004], sparse coding [Chalk et al., 2016], classification us-ing deep learning [Alemi et al., 2016], cognitive science andlanguage [Zaslavsky et al., 2018] and reinforcement learn-ing [Goyal et al., 2019; Strouse et al., 2018]. We apply aninformation regularizer to an RL agent that results in a setof sparse states where necessary option information is high,which correspond to our sub-goals.

5 ConclusionWe devise a principled approach to identify sub-goals in anenvironment without any extrinsic reward supervision usinga sandwich bound on the empowerment of [Gregor et al.,2016]. Our approach yields sub-goals that aid efficient ex-ploration on external-reward tasks and subsequently lead tobetter success rate and sample complexity in novel environ-ments (competitive with supervised sub-goals). Our code andenvironments will be made publicly available.


2027

References[Achiam et al., 2018] Joshua Achiam, Harrison Edwards,

Dario Amodei, and Pieter Abbeel. Variational optiondiscovery algorithms. arXiv preprint arXiv:1807.10299,2018.

[Alemi et al., 2016] Alexander A Alemi, Ian Fischer,Joshua V Dillon, and Kevin Murphy. Deep variationalinformation bottleneck. In ICLR, 2016.

[Chalk et al., 2016] Matthew Chalk, Olivier Marre, andGasper Tkacik. Relevant sparse codes with variational in-formation bottleneck. 2016.

[Chechik et al., 2005] Gal Chechik, Amir Globerson, Naf-tali Tishby, and Yair Weiss. Information bottleneck forgaussian variables. J. of Machine Learning Research,6(Jan):165–188, 2005.

[Chevalier-Boisvert et al., 2018] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Min-imalistic gridworld environment for openai gym.https://github.com/maximecb/gym-minigrid, 2018.

[Cover and Thomas, 1991] Thomas M Cover and Joy AThomas. Elements of information theory. Wiley Seriesin Telecommunications, 1991.

[Eysenbach et al., 2018] Benjamin Eysenbach, AbhishekGupta, Julian Ibarz, and Sergey Levine. Diversity is all youneed: Learning skills without a reward function. CoRR,abs/1802.06070, 2018.

[Goyal et al., 2019] Anirudh Goyal, Riashat Islam, DanielStrouse, Zafarali Ahmed, Matthew Botvinick, HugoLarochelle, Yoshua Bengio, and Sergey Levine. InfoBot:Transfer and exploration via the information bottleneck.arXiv [stat.ML], January 2019.

[Gregor et al., 2016] Karol Gregor, Danilo JimenezRezende, and Daan Wierstra. Variational intrinsiccontrol. arXiv preprint arXiv:1611.07507, 2016.

[Hayes-Roth and Hayes-Roth, 1979] Barbara Hayes-Rothand Frederick Hayes-Roth. A cognitive model ofplanning. Cognitive science, 3(4):275–310, 1979.

[Higgins et al., 2017] Irina Higgins, Loic Matthey, Arka Pal,Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. beta-VAE:Learning basic visual concepts with a constrained varia-tional framework. In ICLR, 2017.

[Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt,Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,David Silver, and Daan Wierstra. Continuous con-trol with deep reinforcement learning. arXiv preprintarXiv:1509.02971, 2015.

[Lopes et al., 2012] Manuel Lopes, Tobias Lang, Marc Tou-ssaint, and Pierre-Yves Oudeyer. Exploration in model-based reinforcement learning by empirically estimatinglearning progress. In NIPS, 2012.

[Machado et al., 2017] Marios Machado, Marc Bellemare,and Michael Bowling. A laplacian framework for optiondiscovery in reinforcement learning. In ICML, 2017.

[McGovern and Barto, 2001] Amy McGovern and An-drew G. Barto. Automatic discovery of subgoals inreinforcement learning using diverse density. In ICML,2001.

[Pathak et al., 2017] Deepak Pathak, Pulkit Agrawal,Alexei A Efros, and Trevor Darrell. Curiosity-drivenexploration by self-supervised prediction. In Proc. ofthe IEEE Conference on Computer Vision and PatternRecognition Workshops, pages 16–17, 2017.

[Polani et al., 2006] Daniel Polani, Chrystopher L Nehaniv,Thomas Martinetz, and Jan T Kim. Relevant informationin optimized persistence vs. progeny strategies. In In: Ar-tificial Life X: Proc. of the Tenth International Conferenceon the Simulation and Synthesis of Living Systems. MitPress, 2006.

[Salge et al., 2013] Christoph Salge, Cornelius Glackin, andDaniel Polani. Empowerment – an introduction. October2013.

[Schmidhuber, 1990] Jurgen Schmidhuber. A possibility forimplementing curiosity and boredom in model-buildingneural controllers. In Proc. of the First International Con-ference on Simulation of Adaptive Behavior on From An-imals to Animats, pages 222–227, Cambridge, MA, USA,1990. MIT Press.

[Still et al., 2004] Susanne Still, William Bialek, and LeonBottou. Geometric clustering using the information bot-tleneck method. In S Thrun, L K Saul, and B Scholkopf,editors, NIPS. 2004.

[Strouse and Schwab, 2017] D J Strouse and David JSchwab. The information bottleneck and geometric clus-tering. December 2017.

[Strouse et al., 2018] D J Strouse, Max Kleiman-Weiner,Josh Tenenbaum, Matt Botvinick, and David Schwab.Learning to share and hide intentions using informationregularization. August 2018.

[Teh et al., 2017] Yee Teh, Victor Bapst, Wojciech M Czar-necki, John Quan, James Kirkpatrick, Raia Hadsell, Nico-las Heess, and Razvan Pascanu. Distral: Robust multitaskreinforcement learning. In NIPS, 2017.

[Tishby et al., 1999] Naftali Tishby, Fernando C Pereira, andWilliam Bialek. The information bottleneck method. InThe 37th annual Allerton Conf. on Communication, Con-trol, and Computing, pages 368–377, 1999.

[van Dijk and Polani, 2011] Sander G van Dijk and DanielPolani. Grounding subgoals in information transitions. InIEEE Symposium on Adaptive Dynamic Programming andReinforcement Learning, pages 105–111, 2011.

[Zaslavsky et al., 2018] Noga Zaslavsky, Charles Kemp,Terry Regier, and Naftali Tishby. Efficient compressionin color naming and its evolution. Proc. Natl. Acad. Sci.U. S. A., 115(31):7937–7942, July 2018.

[Ziebart et al., 2008] Brian D Ziebart, Andrew L Maas,J Andrew Bagnell, and Anind K Dey. Maximum entropyinverse reinforcement learning. In Aaai, volume 8, pages1433–1438. Chicago, IL, USA, 2008.


2028

https://github.com/maximecb/gym-minigrid

ir-vic: unsupervised discovery of sub-goals for transfer in rl · ir-vic: unsupervised discovery of...

Documents