nash qlearning for generalsum stochastic games › ... › presentations › presentation11.pdf ·...
TRANSCRIPT
![Page 1: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/1.jpg)
Nash QLearning for GeneralSum Stochastic Games
Hu, J. and Wellman, M.P.
Lauri Lyly, [email protected]
![Page 2: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/2.jpg)
Stochastic generalsum games
● Stochasticity: Environment is in part formed by other agents– nondeterministic, noncooperative nature, no agreements
● Arbitrary relation between agents' rewards– Extends last time's topic zerosum
![Page 3: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/3.jpg)
Nash Equilibrium
● Bestresponse joint strategy● Study limited to stationary strategies (policies)● Rewards of others are perceived, strategies are not
![Page 4: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/4.jpg)
Nash QValues
![Page 5: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/5.jpg)
Stage game
● Oneperiod game as opposed to stochastic
● Mainly used in convergence proof● Nash equilibrium for the stage game. M is a “payoff
function”:
![Page 6: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/6.jpg)
Update rule
● Same update rule for agent itself and its conjecture on other agent's Qfunctions
● Qfunctions can be initialized for example to 0● Asynchronous updating: only entries pertaining to current
state are updated
![Page 7: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/7.jpg)
The Nash Qlearning algorithm
![Page 8: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/8.jpg)
Convergence proof requirements
● Assumption 1: Every stateaction tuple is visited infinitely often
● Assumption 2: Learning rate alpha(t) satisfies:– Sum from goes towards infinity– Squared sum does not– alpha = 0 if the element being updated doesn't correspond to
current stateaction tuple (asynchronous updating)
![Page 9: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/9.jpg)
Proof basis and result
● Qlearning process updated by pseudocontraction operator using the usual form:
Q=(1alpha(t))Q(t)+alpha(t)[P(t)Q(t)]Contraction: Values approach optimal Q
● Link between stage games and stochastic games● Goal: Show that NashQ is a pseudocontraction operator● Actually a real contraction operator in restricted conditions
– Game with special types of Nash equilibrium points
![Page 10: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/10.jpg)
Different Nash equilibria
● Global optimal point using stage game notation
● Saddle point
● All equilibria chosen for update must be same type
![Page 11: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/11.jpg)
Stage games compared
Global optimal point (Up, Left) Saddle point (Down,Right)
![Page 12: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/12.jpg)
Experimentation framework
● Two gridworld games
● Motivation for grid games: statespecific actions, qualitative transitions, immediate and longterm rewards
![Page 13: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/13.jpg)
Learning process
● Violates assumption 3 of monotonic selection of global optima or saddle points.
● Still converges in most cases regardless of selection● Offline and online learning rated separately
![Page 14: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/14.jpg)
![Page 15: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/15.jpg)
![Page 16: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/16.jpg)
Conclusions
● No current method provides performance guarantees for generalsum stochastic games
● Works as a starting point– Other promising variants– Nash equilibrium itself can be refined
![Page 17: Nash QLearning for GeneralSum Stochastic Games › ... › presentations › presentation11.pdf · Stochastic generalsum games Stochasticity: Environment is in part formed by other](https://reader034.vdocuments.site/reader034/viewer/2022042400/5f0ea9637e708231d4405158/html5/thumbnails/17.jpg)
References
● [7] Hu, J. and Wellman, M.P. (2003). Nash QLearning for GeneralSum Stochastic Games. Journal of Machine Learning Research 4:1039–1069.