extensive games raunak jain

Upload: raunak-jain

Post on 03-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Extensive Games Raunak Jain

    1/2

    ExtensivegamesRaunakJain

    Thedifferencebetweenconcentratingonmovesratherthanstrategiesisthebasic

    differencebetweenresponseandreinforcementlearningmethodsandgamesinextensiveform,duetotheirsize,canonlybesolvedasareinforcementlearning

    problem.Inthesetupofthepaper,thegameisplayedasfollows(namelyexploratorymyopicstrategyrule):Theplayerusesavaluationmethodwherehe/sheevaluateshis

    strategybyaveragingonthevaluereceivedbyplayingthatstrategybeforethat

    roundandthenrevisesitafterplayingtheactionwiththehighestvaluation.Soineachroundhemyopicallychosesthestrategy,whichisthebestvaluationatthat

    roundandnotnecessarilythemaximizingstrategyfortherepeatedgame.

    Nowconsideringtheseupdatingruleswecansaythefollowingaboutthe

    convergenceofthepayoffsofthelearning/non-learningplayersinthelongrun.

    a) Thelearningplayerwhenplayinginascenariowhereonlyhelearns,earnsapayoffaftersometimepointwhichisatleast(1-epsilon)*minimaxpayofffortheplayerinthestagegame.Thereforinawayitguaranteesapayoffto

    theplayerunderconsiderationdespitewhatotherplayersdo.Thereforeinthisscenariothelearningplayerlearnstoplaythestagegameitselfandnot

    againstwhattheopponentsmightplay.

    b) Inacasewhereallplayerslearnthepayoffstoeachplayerisgreaterthanwhathe/shemighthavegotincase(a).Thegameplayedwiththeearlier

    assumptionsoflearningbyeachplayersimultaneouslythenthewholegameconvergestothesubgameperfectnashequilibrium.

    Thecaseofthewinlosegames,wherethepayoffsareeither1or0,thatistheplayereitherwinsorloses,exploratorymyopicstrategyrulecanbeeasedinto

    justmyopic,wherethemoveswithnotthehighestvaluationarenotconsidered

    tobeplayedatall.Alsotheaveragingprocessismodifiedtoconsideronlythelastroundandnotthewholehistory.Insuchacasetheplayerwhowinsinthe

    stagegamealwayswinsinthelongrunifhe/shefollowstherules.Noassumptionsismadeonthestrategiesoftheotherplayersandsincetheir

    strategiesmightstillbedependentonthehistorytheprocessgeneratedcannot

    bemodeledintoamarkovprocess.

    Theexampleofthe0-1gameandthelearningprocessmentionedfordeleting

    thestrategywiththeleastpayoffisnotthesameasallottingprobability0tothenothighestvaluationmethod.

    QS.Iftheplayerassumesthattheotherplayersplaywithacertainstrategyand

    heisawareofthepayoffsthendoesntitbecomeamarkovprocess?

    Sincethelearningprocessdoesnotassumeanythingabouttheplayersandjustconsiderswhatshe/hegetsoutofthatmoveitiskindoflowinformation

  • 7/29/2019 Extensive Games Raunak Jain

    2/2

    requirementstrategyalsothingslikesimilarityofmovesmakesitefficienttouse

    itonthecomputer.Theorem1considersawin-losesituationandgivenapositiveinitialvaluationof

    astrategyiinasuperstrategysetandthegamehasastrategywhichguaranteeshimawin,thenplayingwithmyopicstrategyandmemoryless

    revision,thereisatimewhenthestrategyiswinningforever.IntheexamplewhereplayeraevaluatestheinitialpayoffstoLandRas0,itis

    shownthatplayeracannotreachthestrategywhichiswinning,orgiveshim

    thewinningpayoffwithprobability1,andcanonlyreachitwithprobability.

    Qs.Isgivingpositiveinitialvaluationstotheactionsenoughtoguaranteetheconvergencetoawinningstrategyortherearecaseswhenevenpositiveinitial

    valuationscanleadtosomestrategiesbeingruledout?