introduction to contextual multi-bandit algorithm to contextual multi-bandit... · outline...

AnIntroductiontoContextualBanditsAlgorithm

Ph.D.Candidate:QingWang,2016FloridaInternationalUniversity

Outline

§ Introduction§ Motivation§ Contextual-freeBanditAlgorithms§ ContextualBanditAlgorithms§ OurWork

§ EnsembleContextualBanditsforPersonalizedRecommendation§ PersonalizedRecommendationviaParameter-FreeContextualBandits

§ FutureWork§ Q&A

WhatisPersonalizedRecommendation?

§ PersonalizedRecommendationhelpusersfindinterestingitemsbasedtheindividualinterestofeachitem.§ UltimateGoal:maximizeuser

engagement.

WhatisColdStartProblem?

§ Donothaveenoughobservationsfornew itemsornew users.§ Howtopredictthepreferenceofusersifwedonothavedata?

§ Manypracticalissuesforofflinedata§ Historicaluserlogdataisbiased.§ Userinterestmaychange overtime.

Approach:Multi-armedBanditAlgorithm

§ Agamblerwalksintoacasino§ Arowofslotmachinesproviding

arandomrewards

Objective:Maximizethesumofrewards(Money)!

Example:NewsPersonalization

§ Recommend newsbasedonusers’interests.

§ Goal:Maximizeuser’sClick-Through-Rate.

[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.

§ Thereareabunchofarticlesinthenewspool§ Userscomesequentiallyandreadytobeenter

Newsarticles

[1]ZhouLi,“NewspersonalizationwithMulti-armedbandits”.

§ Ateachtime,wewanttoselectonearticleforuser.

newsarticles

article1

Likeit?

§ Goal:maximumCTR.

newsarticles

article1

Likeit?

Notreally!

§ Updatethemodelwithuser’sfeedback

newsarticles

article1

Likeit?

Notreally!

feedback

• Updatethemodeloncegiventhefeedback

newsarticles

article2

Likeit?

§ Updatethemodeloncegiventhefeedback

newsarticles

article2

Likeit?

feedback

Howaboutarticle3,4,5…?

Multi-armedBanditDefinition

§ TheMABproblemisaclassicalparadigminMachineLearninginwhichanonlinealgorithmchosesfromasetofstrategiesinasequenceoftrialssoastomaximizethetotalpayoffofthechosenstrategies.[1]

[1]http://research.microsoft.com/en-us/projects/bandits/

Application:ClinicalTrial

[1]Einstein,A.,B.Podolsky,andN.Rosen,1935,“Canquantum-mechanicaldescriptionofphysicalrealitybeconsideredcomplete?”,Phys.Rev. 47,777-780

§ Twotreatmentswithunknowneffectiveness

Webadvertising

[1]TangL,RosalesR,SinghA,etal.Automaticadformatselectionviacontextualbandits[C],Proceedingsofthe22ndACMinternationalconferenceonConferenceoninformation&knowledgemanagement.ACM,2013:1587-1594.

§ Wheretoplacethead?

PlayingGolfwithmulti-balls

[1]Dumitriu,Ioana,PrasadTetali,andPeterWinkler."Onplayinggolfwithtwoballs." SIAMJournalonDiscreteMathematics 16.4(2003):604-615.

Multi-AgentSystem

[1]Ny,JeromeLe,Munther Dahleh,andEricFeron."Multi-agenttaskassignmentinthebanditframework."DecisionandControl,200645thIEEEConferenceon.IEEE,2006.

§ KagentstrackingN(N>K)targets:

SomeJargonTerms[1]

§ Arm:oneidea/strategy§ Bandit:Agroupofideas(strategies)§ Pull/Play/Trial:Onechancetotryyourstrategy§ Reward:Theunitofsuccesswemeasureaftereachpull§ Regret:PerformanceMetric

[1]BanditAlgorithmsforWebsiteOptimizationDeveloping,Deploying,andDebuggingBy JohnMylesWhite,O'ReillyMedia,2012

K-ArmedBandit

[1]CS246MiningMassiveDataSets2015,StanfordUniversity

§ EachArma§ Wins(reward=1)withfixed(unknown)prob.𝜇#§ Loses(reward=0)withfixed(unknown)prob.(1 − 𝜇#)

§ Alldrawsareindependentgiven𝜇(…𝜇*§ Howtopullarmstomaximizetotalreward?(estimatethe

arm’sprob.ofwinning𝜇#)

ModelofK-ArmedBandit

§ Setof𝒌 choices(arms)§ Eachchoicea isassociatedwithunknownprobability

distribution𝑷𝒂 in[0,1]§ WeplaythegameforTrounds§ Ineachroundt:

§ Wepicksomearmj§ Weobtainrandomsample𝑿𝒕 from𝑷𝒋(rewardis

independentofpreviousdraws)§ Goal:maximize∑ 𝑿𝒕2

34( (withoutknown𝜇#)§ However,everytimewepullsomearmawegettolearnabit

about𝜇#.

PerformanceMetric:Regret

§ Letbe𝜇# themeanof𝑷𝒂§ Payoff/reward bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}§ Let𝑖(, …𝑖2 bethesequenceofarmspulled§ Instantaneousregretattimet:𝑟3= 𝜇∗ −𝜇#>§ Totalregret:

§ 𝑹𝑻 =∑ 𝒓𝒕234(

§ Typicalgoal:armallocationstrategythatguarantees:

§𝑹𝑻2→ 0asT → ∞

AllocationStrategies

§ Ifweknewthepayoffs,whicharmshouldwepull?§ bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}

§ Whatifweonlycareaboutestimatingpayoff𝜇#?

§ Pickeachofkarmsequallyoften:𝑻*

§ Estimate :𝜇#H = ∑ 𝑋#,JKLJ4( /(𝑻

*) 𝒌

2∑ 𝑋#,J2/*J4(

§ Totalregret:

§ 𝑹𝑻 =𝑻*∑ (𝜇∗−𝜇#)*

ExploitationvsExploration

§ Tradeoff:§ Onlyexploitation(makingdecisionsbasedonhistorydata),youwillhavebadestimationfor“best”items.

§ Onlyexploration(gatheringdataaboutarmpayoffs),youwillhavelowuser’sengagement.

AlgorithmtoExploration&Exploitation

Exploration Exploitation

Contextualfree Contextual

1. Epsilonalgorithm[1]2. UCB1[2]

1. Ex3,Ex42. TompsonSampling[3]3. LinUCB[4]

tradeoff

[1]WynnP.Ontheconvergenceandstabilityoftheepsilonalgorithm[J].SIAMJournalonNumericalAnalysis,1966,3(1):91-122.[2]AuerP,Cesa-BianchiN,FischerP.Finite-timeanalysisofthemulti-armedbanditproblem[J].Machinelearning,2002,47(2-3):235-256.[3]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.[4]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.

§ Ittriestobefairtothetwooppositegoalsofexploration(withprob.𝜀)andexploitation(1-𝜀)byusingamechanism:flipsacoin.

𝜀-GreedyAlgorithm

Roundt

Exploration

Exploitation(choosebestarm)

1-𝜺

Arm𝑎∗

Arm𝑎1/k

1-𝜺

𝜺/k

Arm𝑏

𝜺/k1/k

§ Fort=1:T§ Set𝜀3 = 𝑂 (

3§ Withprob.𝜀3: Explorebypickinganarmchosen

uniformlyatrandom§ Withprob.1-𝜀3:Exploitbypickinganarmwith

highestempiricalmeanpayoff§ Theorem[Aueretal.‘02]

§ Forsuitablechoiceof𝜀3 itholdsthat

𝜀-GreedyAlgorithm

§ Notelegant”:Algorithmexplicitlydistinguishesbetweenexplorationandexploitation

§ Moreimportantly:Explorationmakessuboptimalchoices(sinceitpicksanyarmequallylikely)

§ Idea:Whenexploring/exploitingweneedtocomparearms.

Issueswith𝜀-GreedyAlgorithm

Example:ComparingArms

§ Supposewehavedoneexperiments:§ Arm1:1001110001§ Arm2:1§ Arm3:1101001111

§ Meanarmvalues:§ Arm1:5/10Arm2:1Arm3:7/10

§ Whicharmwouldyouchoosenext?§ Idea:Notonlylookatthemeanbutalsothe

confidence!

ConfidenceIntervals

§ Aconfidenceintervalisarangeofvalueswithinwhichwearesurethemeanlieswithacertainprobability§ Wecouldbelieve𝜇# iswithin[0.2,0.5]with

probability0.95§ Ifwewouldhavetriedanactionlessoften,our

estimatedrewardislessaccuratesotheconfidenceintervalislarger

§ Intervalshrinksaswegetmoreinformation(trytheactionmoreoften)

ConfidenceBasedSelection

§ Assumingweknowtheconfidenceintervals§ Then,insteadoftryingtheactionwiththehighestmeanwecan

trytheactionwiththehighestupperboundonitsconfidenceinterval.

ConfidenceintervalsvsSamplingtimes

Theestimationofconfidencebecomessmallerasthenumberofpullingtimesincreases.

[1]Jean-YvesAudibert andRemiMunos,IntroductiontoBandits:AlgorithmsandTheory.ICML2011,Bellevue(WA),USA

CalculatingConfidenceBounds

§ Supposewefixarma:§ Let𝑟#,(… 𝑟#,T bethepayoffsofarmainthefirstm

trials§ 𝑟#,(… 𝑟#,T arei.i.d.takingvaluesin[0,1]

§ Ourestimate:𝜇#,TU = 𝟏T∑ 𝑟#,JTJ4(

§ Wanttofindbsuchthatwithhighprobability𝜇# − 𝜇#,TU ≤ 𝑏 (wantbtobeassmallaspossible)

§ Goal:Wanttobound𝐏( 𝜇# − 𝜇#,TU ≤ 𝑏)

UCB1Algorithm

Hoeffding’s Inequality

§ UCB1(Upperconfidencesampling)algorithm§ Let𝜇(H … = 𝜇*H = 0and𝑚( =…=𝑚* =0

§ 𝜇#H isourestimateofpayoffofarm 𝑖§ 𝑚# is thenumberofpullsofarm𝑖 sofar.

§ Fort=1:T

§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab

§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)

UCB1Algorithm:Discussion

§ Confidenceintervalgrowswiththetotalnumberofactionstwehavetaken

§ ButShrinkswiththenumberoftimes𝑚# wehavetriedarma

§ Thisensureseacharmistriedinfinitelyoftenbutstillbalancesexplorationandexploitation

§ 𝛼 playstheroleof𝛿: 𝛼 = f mn= 1 + _`(^/o)

§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab

§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)

UCB1AlgorithmPerformance

§ Theorem[Aueretal.2002]§ Supposeoptimalmeanpayoffis§ Andforeacharmlet§ Thenitholdsthat

§ So,weget

ContextualBandits

§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms

togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm

choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach

observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)

LinUCBAlgorithm[1]

§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms

togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm

choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach

observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)

[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.

LinUCBAlgorithm

§ Expectationofrewardofeacharmismodeledasalinearfunctionofthecontext.

Payoffofarma:E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗

§ Thegoalistominimizeregret,definedasthedifferencebetweentheexpectationoftherewardofbestarmsandtheexpectationoftherewardofselectedarms.

𝑅3 𝑇 ≝ 𝐸 {𝑟3,#>∗

− 𝐸[{𝑟3,#>

𝒙𝒕,𝒂 isad-dimensionalfeaturevector

𝜽𝒂∗ istheunknowncoefficientvectorweaimtolearn

LinUCBAlgorithm

§ E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗§ Howtoestimate𝜃#?

§ Linearregressionsolutionto𝜃# is𝜽𝒂} = 𝒂𝒓𝒈𝒎𝒊𝒏𝜽 ∑ ([𝑥3,#]2𝜃# −𝒃𝒂

(𝒎))m�𝒎∈𝑫𝒂

Wecanget:𝜽𝒂} = (𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏 𝑫𝒂𝑻𝒃𝒂

𝑫𝒂 isam× dmatrixofmtraininginputs[𝑥3,#]

𝒃𝒂 isam-dimensionvectorofresponsesto𝒂(click/no-click)

LinUCBAlgorithm

§ UsingsimilartechniquesasweusedforUCB

|[𝑥3,#]2𝜽𝒂} − E 𝑟3,# 𝑥3,# | ≤ 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#�

§ Foragivencontext,weestimatetherewardandtheconfidenceinterval.

𝒂𝒕 ≝ 𝒂𝒓𝒈𝒎𝒂𝒙𝒂∈𝑨𝒕([𝑥3,#]2𝜽𝒂} + 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#

𝜶 = 𝟏 + 𝒍𝒏(𝟐/𝜹)/𝟐�

Estimated𝜇# Confidenceinterval

LinUCBAlgorithm§ Initialization:𝐴# ≝ 𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅

§ Foreacharm𝑎:§ 𝐴# = 𝐼� //identitymatrixd×d§ 𝑏# = [0]� //vectorofzeros

§ Onlinealgorithm:§ Fort=[1:T]:

§ Observefeaturesforallarms𝑎 ∶ 𝑥3,# ∈ 𝑅�§ Foreacharm𝑎 ∶

§ 𝜃# = 𝐴#�(𝑏# //regressioncoefficients

§ 𝑝3,# = [𝑥3,#]2𝜃# + 𝜶 [𝑥3,#]2𝐴#�(𝑥3,#�

§ Choosearm𝑎3 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑝3,# //choosearm§ 𝐴#> = 𝐴#> + 𝑥3,#>[𝑥3,#>]

2 //updateAforthechosenarm𝑎3§ 𝑏#> = 𝑏#> + 𝑟3𝑥3,#> //updatebforthechosenarm𝑎3

LinUCB: Discussion

§ LinUCBcomputationalcomplexityis§ Linear inthenumberofarmsand§ Atmostcubicinthenumberoffeatures

§ LinUCBworkswellforadynamic armset(armscomandgo)§ Forexample,innewsarticlerecommendation,for

instance,editorsadd/removearticlesto/fromapool

DifferentbetweenUCB1andLinUCB

§ UCB1 directlyestimates𝜇# throughexperimentation(withoutanyknowledgeaboutarma)

§ LinUCB estimates𝜇# byregression𝜇# = [𝑥3,#]2𝜽𝒂∗§ Thehopeisthatwewillbeabletolearnfasteras

weconsiderthecontext𝑥#(user, ad) ofarma§ 𝜽𝒂∗ unknowncoefficientvectorweaimtolearn

ThompsonSampling

§ AsimplenaturalBayesianheuristic§ Maintainabelief(distribution)fortheunknown

parameters§ Eachtime,pullarma andobserveareward𝑟

§ Initializepriorsusingbeliefdistribution§ Fort=1:T:

§ SamplerandomvariableXfromeacharm’sbeliefdistribution

§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ Updatepriorbeliefdistributionforselectedarm

[1]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.

SimpleExample

§ Cointoss:x̴Bernoulli(𝜃)§ Let’sassumethat

§ 𝜃̴Beta(𝛼£, 𝛼2)§ P(𝜃)∝ 𝜃¥¦�((1 − 𝜃)¥K�(

§ 𝑃 𝜃 𝑋 = ¨ 𝑋 𝜃 ¨(©)∑ ¨(ª|©)�«

Posterior

Thepriorisconjugate!

Betadistribution

ThompsonSamplingUsingBetabeliefdistribution§ Theorem[Emilieetal.2012]

§ Initiallyassumesarm𝒊 withpriorBeta(1,1)on𝝁𝒊§ 𝑆® =#“Success”,𝐹®=#“Failure”

ThompsonSamplingUsingBetabeliefdistribution

Arm1 Arm2 Arm3

Beta(1,1) Beta(1,1) Beta(1,1)

§ Initialization

Arm1 Arm2 Arm3

X0.70.20.4

§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta

Distribution

Arm1 Arm2 Arm3

X0.70.20.4

Distribution§ SelectthearmwithlargestX

Arm1 Arm2 Arm3

X0.70.20.4

Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm

Success!

Arm1 Arm2 Arm3

X0.70.20.4

Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ UpdatepriorBetadistributionforselectedarmSuccess!

OurResearch1:EnsembleContextualBanditsforPersonalizedRecommendation

[1]Tang,Liang,etal."Ensemblecontextualbanditsforpersonalizedrecommendation." Proceedingsofthe8thACMConferenceonRecommendersystems.ACM,2014.

ProblemStatement

§ ProblemSetting: havemanydifferentrecommendationmodels(orpolicies):§ DifferentCTRPredictionAlgorithms.§ DifferentExploration-ExploitationAlgorithms.§ DifferentParameterChoices.

§ Nodatatodomodelvalidation§ ProblemStatement:howtobuildanensemblemodelthatis

closetothebestmodelinthecoldstartsituation?

HowEnsemble?

§ Classifierensemblemethoddoesnotworkinthissetting§ RecommendationdecisionisNOTpurelybasedonthepredictedCTR.

§ Eachindividualmodelonlytellsus:§ Whichitemtorecommend.

EnsembleMethod

§ OurMethod:§ Allocaterecommendationchancestoindividualmodels.

§ Problem:§ Bettermodelsshouldhavemorechances.§ Wedonotknowwhichoneisgoodorbadinadvance.§ Idealsolution:allocateallchancestothebestone.

CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.

§ Deployπ1,π2 …πm intotheonlinesystematthesametime.

§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRastheproductionmodel.

CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.

§ Deployπ1,π2 …πm intotheonlinesystematthesametime.

§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRas

theproductionmodel.

Ifwehavetoomanymodels,thiswillhurttheperformanceoftheonlinesystem.

OurIdea1(HyperTS)

§ TheCTRofmodelπi isarandomunknownvariable,Ri .§ Goal:

§ maximize,rt isarandomnumberdrawnfromRs(t),s(t)=1,2,…,orm.Foreacht=1,…,N,wedecides(t).

§ Solution:§ BernoulliThompsonSampling (flatprior:beta(1,1)).

§ π1,π2 …πmarebanditarms.

∑ CTRofourensemblemodel

Notrickyparameters

AnExampleofHyperTS

Inmemory,wekeeptheseestimatedCTRsforπ1,π2 …πm.

AnExampleofHyperTSAuservisit

HyperTSselectsacandidatemodel,πk .

EstimatedCTRs

πk recommendsitemA totheuser.

xt::contextfeatures

EstimatedCTRs

πk recommendsitemA totheuser.

xt::contextfeatures

EstimatedCTRs

…HyperTSupdatesthe

estimationofRk basedon rt.

update

Two-LayerDecision

BernoulliThompsonSampling

π1 π2 πmπk

Item A Item B Item C

… …

OurIdea2(HyperTSFB)

§ LimitationofPreviousIdea:§ Foreachrecommendation,userfeedbackisusedbyonlyoneindividualmodel (e.g.,πk).

§ Motivation:§ CanweupdateallR1,R2,…,Rm byeveryuserfeedback?(Shareeveryuserfeedbacktoeveryindividualmodel).

OurIdea2(HyperTSFB)

§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.

§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingA givenxt.

OurIdea2(HyperTSFB)

§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.

§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingAgivenxt.

EstimatetheCTRof π1,π2 …πm(ImportanceSampling)

ExperimentalSetup

§ ExperimentalData§ Yahoo!TodayNewsdatalogs(randomlydisplayed).§ KDDCup2012OnlineAdvertisingdataset.

§ EvaluationMethods§ Yahoo!TodayNews:Replay (seeLihongLiet.al’s WSDM2011paper).

§ KDDCup2012Data:Simulation byaLogisticRegressionModel.

ComparativeMethods

§ CTRPredictionAlgorithm§ LogisticRegression

§ Exploitation-ExplorationAlgorithms§ Random,ε-greedy,LinUCB,Softmax,Epoch-greedy,Thompsonsampling

§ HyperTSandHyperTSFB

ResultsforYahoo!NewsData

§ Every100,000impressionsareaggregatedintoabucket.

ResultsforYahoo!NewsData(Cont.)

Conclusions

§ Theperformanceofbaselineexploitation-explorationalgorithmsisverysensitivetotheparametersetting.§ Incold-startsituation,noenoughdatatotuneparameter.

§ HyperTSandHyperTSFBcanbeclosetotheoptimalbaselinealgorithm(Noguaranteebebetterthantheoptimalone),eventhoughsomebadindividualmodelsareincluded.

§ ForcontextualThompsonsampling,theperformancedependsonthechoiceofprior distributionforthelogisticregression.§ ForonlineBayesianlearning,theposteriordistribution

approximationisnotaccurate(cannotstorethepastdata).

OurResearch2:PersonalizedRecommendationviaParameter-FreeContextualBandits

[1] Tang,Liang,etal."Personalizedrecommendationviaparameter-freecontextualbandits." Proceedingsofthe38thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval.ACM,2015.

HowtoBalanceTradeoff

§ Performanceismainlydeterminedbythetradeoff.Existingalgorithmsfindthetradeoffbyuserinputparametersanddatacharacteristics(e.g.,varianceoftheestimatedreward).

§ Existingalgorithmsareallparameter-sensitive.

Analgorithm

Goodalgorithmparameterisgood

algorithmparameterisbad

Chicken-and-EggProblemforExistingBanditAlgorithms

§ Whyweusebanditalgorithms?§ Solvethecoldstartproblem(Noenoughdataforestimatinguserpreferences).

§ Howtofindthebestinputparameters?§ Tunetheparametersonlineoroffline.

ifyoualreadyhavethedataoronlinetraffictotunetheparameters,whydoyouneedbandit

algorithms?

OurWork

§ Parameter-free:§ Itcanfindthetradeoffbydatacharacteristicsautomatically.

§ Robust:§ Existingalgorithmcanhaveverybadperformanceiftheinputparameterisnotappropriate.

Solution

§ ThompsonSampling§ Randomlyselectamodelcoefficientvectorfromposteriordistributionandfindthe“best”item.

§ Prior istheinputparameterforcomputingposterior.

§ Non-BayesianThompsonSampling(OurSolution)§ RandomlyselectabootstrapsampletofindtheMLEofmodelcoefficientandfindthe“best”item.

§ Bootstrappinghasnoinputparameter.

BootstrapBanditAlgorithm

Input:afeaturevectorx ofthecontext.Algorithm:

if each article has sufficient observations then {for each article i=1,…, k

i. Di ç randomly sample nk impression data of article i with replacement // Generate a bootstrap sample

ii. θiç MLE coefficient of Di // Model estimation on bootstrap sampleselect the article i* = argmax(f(x, θi)), i=1,…, k. to show.

}else{randomly select an article that has no sufficient observations to show.

Predictionfunction

OnlineBootstrapBandits

§ WhyOnlineBootstrap?§ Inefficienttogenerateabootstrapsampleforeachrecommendation.

§ Howtoonlinebootstrap?§ Keepthecoefficientestimatedbyeachbootstrapsampleinmemory.

§ Noneedtokeepallbootstrapsamplesinmemory.§ Whenanewdataarrives,incrementallyupdatetheestimatedcoefficientforeachbootstrapsample[1].

[1]N.C.Oza andS.Russell.Onlinebaggingandboosting.InIEEEinternationalconferenceonSystems,manandcybernetics,volume3,pages2340–2345,2005.

ExperimentData§ Twopublicdatasets

§ Newsrecommendationdata(Yahoo!TodayNews)§ NewsdisplayedontheYahoo!FrontPagefromOct.2nd,2011toOct.16th 2011.

§ 28,041,015uservisitevents.§ 136dimensionsoffeaturevectorforeachevent.

§ Onlineadvertisingdata(KDDCup2012,Track2)§ ThedatasetiscollectedbyasearchengineandpublishedbyKDDCup2012.

§ 1millionuservisitevents.§ 1,070,866dimensionsofthecontextfeaturevector.

OfflineEvaluationMetricandMethods§ Setup

§ OverallCTR(averagerewardofatrial).

§ EvaluationMethod§ TheexperimentonYahoo!TodayNewsisevaluatedbythereplaymethod[1].

§ TherewardonKDDCup2012ADdataissimulatedwithaweightvectorforeachAD[2].

[1]L.Li,W.Chu,J.Langford,andX.Wang.Unbiasedofflineevaluationofcontextual-bandit-basednewsarticlerecommendationalgorithms.InWSDM,pages297–306,2011.[2] O.Chapelle andL.Li.Anempiricalevaluationofthompson sampling.InNIPS,pages2249–2257,2011.

Experimental Methods

§ Ourmethod§ Bootstrap(B),whereB isthenumberofbootstrapsamples.

§ Baselines§ Random:itrandomlyselectsanarmtopull.§ Exploit:itonlyconsidertheexploitationwithoutexploration.§ ε-greedy(ε):ε istheprobabilityofexploration.§ LinUCB(α):itpullsthearmwithlargestscoredefinedbytheparameter

α§ TS(q0):Thompsonsamplingwithlogisticregression,whereq0-1 isthe

priorvariance,0isthepriormean.§ TSNR(q0):SimilartoTS(q0),butthelogisticregressionisnotregularized

bytheprior.

Experiment(Yahoo!NewsData)§ Allnumbersarerelativetotherandommodel.

Experiment(ADKDDCup’12)§ Allnumbersarerelativetotherandommodel.

CTRoverTimeBucket(Yahoo!NewsData)

CTRoverTimeBuckets(KDDCupAdsData)

Efficiency§ Timecostondifferentbootstrapsamplesizes

SummaryofExperiment

§ Summary§ Forsolvingthecontextualbanditproblem,thealgorithmsofє-greedyandLinUCBcanachievetheoptimalperformance,buttheinputparametersthatcontroltheexplorationneedtobetunedcarefully.

§ Theprobabilitymatchingstrategieshighlydependontheselectionoftheprior.

§ Ourproposedalgorithmisasafechoiceofbuildingpredictivemodelsforcontextualbanditproblemsunderthescenarioofcold-start.

Conclusion

§ Proposeanon-BayesianThompsonSamplingmethodtosolvethepersonalizedrecommendationproblem.

§ GiveboththeoreticalandempiricalanalysistoshowthattheperformanceofThompsonsamplingdependsonthechoiceoftheprior.

§ Conductextensiveexperimentsonrealdatasetstodemonstratetheefficacyoftheproposedmethod andothercontextualbanditalgorithms.

FutureWork

§ MABwithsimilarityinformation§ MABinachangingenvironment§ Explore-exploit tradeoffinmechanismdesign§ Explore-exploitlearningwithlimitedresources§ Riskvs.rewardtradeoffinMAB

[1]http://research.microsoft.com/en-us/projects/bandits/

QuestionandAnswer

Thanks!

introduction to contextual multi-bandit algorithm to contextual multi-bandit... · outline...

Documents

evaluation of multi armed bandit algorithms and empirical...

meta-learning for contextual bandit exploration ·...

random forest for the contextual bandit problem

simple regret bandit algorithms for unstructured noisy...

thompson sampling for contextual bandit problems with...

contextual bandit algorithms with supervised learning...

corralling stochastic bandit algorithms

aristotle university of...

[sac 2015] improve general contextual slim recommendation...

conversational contextual bandit: algorithm and application

the multi-armed bandit problem - cae...

agentbuddy: an ir system based on bandit algorithms to...

bandit game

the multi-armed bandit problem€¦ · sumeet katariya...

mostly exploration-free algorithms for contextual...

bandit bandit s bullit breeze furtive trust one

corralling a band of bandit algorithms · 2020. 1. 3. ·...

a practical method for solving contextual bandit problems...

0595 bandit

introduction of multi-arm bandit...