Download - Extensive Games Raunak Jain
-
7/29/2019 Extensive Games Raunak Jain
1/2
ExtensivegamesRaunakJain
Thedifferencebetweenconcentratingonmovesratherthanstrategiesisthebasic
differencebetweenresponseandreinforcementlearningmethodsandgamesinextensiveform,duetotheirsize,canonlybesolvedasareinforcementlearning
problem.Inthesetupofthepaper,thegameisplayedasfollows(namelyexploratorymyopicstrategyrule):Theplayerusesavaluationmethodwherehe/sheevaluateshis
strategybyaveragingonthevaluereceivedbyplayingthatstrategybeforethat
roundandthenrevisesitafterplayingtheactionwiththehighestvaluation.Soineachroundhemyopicallychosesthestrategy,whichisthebestvaluationatthat
roundandnotnecessarilythemaximizingstrategyfortherepeatedgame.
Nowconsideringtheseupdatingruleswecansaythefollowingaboutthe
convergenceofthepayoffsofthelearning/non-learningplayersinthelongrun.
a) Thelearningplayerwhenplayinginascenariowhereonlyhelearns,earnsapayoffaftersometimepointwhichisatleast(1-epsilon)*minimaxpayofffortheplayerinthestagegame.Thereforinawayitguaranteesapayoffto
theplayerunderconsiderationdespitewhatotherplayersdo.Thereforeinthisscenariothelearningplayerlearnstoplaythestagegameitselfandnot
againstwhattheopponentsmightplay.
b) Inacasewhereallplayerslearnthepayoffstoeachplayerisgreaterthanwhathe/shemighthavegotincase(a).Thegameplayedwiththeearlier
assumptionsoflearningbyeachplayersimultaneouslythenthewholegameconvergestothesubgameperfectnashequilibrium.
Thecaseofthewinlosegames,wherethepayoffsareeither1or0,thatistheplayereitherwinsorloses,exploratorymyopicstrategyrulecanbeeasedinto
justmyopic,wherethemoveswithnotthehighestvaluationarenotconsidered
tobeplayedatall.Alsotheaveragingprocessismodifiedtoconsideronlythelastroundandnotthewholehistory.Insuchacasetheplayerwhowinsinthe
stagegamealwayswinsinthelongrunifhe/shefollowstherules.Noassumptionsismadeonthestrategiesoftheotherplayersandsincetheir
strategiesmightstillbedependentonthehistorytheprocessgeneratedcannot
bemodeledintoamarkovprocess.
Theexampleofthe0-1gameandthelearningprocessmentionedfordeleting
thestrategywiththeleastpayoffisnotthesameasallottingprobability0tothenothighestvaluationmethod.
QS.Iftheplayerassumesthattheotherplayersplaywithacertainstrategyand
heisawareofthepayoffsthendoesntitbecomeamarkovprocess?
Sincethelearningprocessdoesnotassumeanythingabouttheplayersandjustconsiderswhatshe/hegetsoutofthatmoveitiskindoflowinformation
-
7/29/2019 Extensive Games Raunak Jain
2/2
requirementstrategyalsothingslikesimilarityofmovesmakesitefficienttouse
itonthecomputer.Theorem1considersawin-losesituationandgivenapositiveinitialvaluationof
astrategyiinasuperstrategysetandthegamehasastrategywhichguaranteeshimawin,thenplayingwithmyopicstrategyandmemoryless
revision,thereisatimewhenthestrategyiswinningforever.IntheexamplewhereplayeraevaluatestheinitialpayoffstoLandRas0,itis
shownthatplayeracannotreachthestrategywhichiswinning,orgiveshim
thewinningpayoffwithprobability1,andcanonlyreachitwithprobability.
Qs.Isgivingpositiveinitialvaluationstotheactionsenoughtoguaranteetheconvergencetoawinningstrategyortherearecaseswhenevenpositiveinitial
valuationscanleadtosomestrategiesbeingruledout?