introduction to contextual multi-bandit algorithm to contextual multi-bandit... · outline...
Post on 31-Aug-2019
64 Views
Preview:
TRANSCRIPT
AnIntroductiontoContextualBanditsAlgorithm
Ph.D.Candidate:QingWang,2016FloridaInternationalUniversity
Outline
§ Introduction§ Motivation§ Contextual-freeBanditAlgorithms§ ContextualBanditAlgorithms§ OurWork
§ EnsembleContextualBanditsforPersonalizedRecommendation§ PersonalizedRecommendationviaParameter-FreeContextualBandits
§ FutureWork§ Q&A
WhatisPersonalizedRecommendation?
§ PersonalizedRecommendationhelpusersfindinterestingitemsbasedtheindividualinterestofeachitem.§ UltimateGoal:maximizeuser
engagement.
WhatisColdStartProblem?
§ Donothaveenoughobservationsfornew itemsornew users.§ Howtopredictthepreferenceofusersifwedonothavedata?
§ Manypracticalissuesforofflinedata§ Historicaluserlogdataisbiased.§ Userinterestmaychange overtime.
Approach:Multi-armedBanditAlgorithm
§ Agamblerwalksintoacasino§ Arowofslotmachinesproviding
arandomrewards
Objective:Maximizethesumofrewards(Money)!
Example:NewsPersonalization
§ Recommend newsbasedonusers’interests.
§ Goal:Maximizeuser’sClick-Through-Rate.
[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.
Example:NewsPersonalization
§ Thereareabunchofarticlesinthenewspool§ Userscomesequentiallyandreadytobeenter
Newsarticles
[1]ZhouLi,“NewspersonalizationwithMulti-armedbandits”.
Example:NewsPersonalization
§ Ateachtime,wewanttoselectonearticleforuser.
newsarticles
article1
Likeit?
MABs
Example:NewsPersonalization
§ Goal:maximumCTR.
MABs
newsarticles
article1
Likeit?
Notreally!
Example:NewsPersonalization
§ Updatethemodelwithuser’sfeedback
MABs
newsarticles
article1
Likeit?
Notreally!
feedback
Example:NewsPersonalization
• Updatethemodeloncegiventhefeedback
MABs
newsarticles
article2
Likeit?
Yeah!
Example:NewsPersonalization
§ Updatethemodeloncegiventhefeedback
MABs
newsarticles
article2
Likeit?
Yeah!
feedback
Howaboutarticle3,4,5…?
Multi-armedBanditDefinition
§ TheMABproblemisaclassicalparadigminMachineLearninginwhichanonlinealgorithmchosesfromasetofstrategiesinasequenceoftrialssoastomaximizethetotalpayoffofthechosenstrategies.[1]
[1]http://research.microsoft.com/en-us/projects/bandits/
Application:ClinicalTrial
[1]Einstein,A.,B.Podolsky,andN.Rosen,1935,“Canquantum-mechanicaldescriptionofphysicalrealitybeconsideredcomplete?”,Phys.Rev. 47,777-780
§ Twotreatmentswithunknowneffectiveness
Webadvertising
[1]TangL,RosalesR,SinghA,etal.Automaticadformatselectionviacontextualbandits[C],Proceedingsofthe22ndACMinternationalconferenceonConferenceoninformation&knowledgemanagement.ACM,2013:1587-1594.
§ Wheretoplacethead?
PlayingGolfwithmulti-balls
[1]Dumitriu,Ioana,PrasadTetali,andPeterWinkler."Onplayinggolfwithtwoballs." SIAMJournalonDiscreteMathematics 16.4(2003):604-615.
Multi-AgentSystem
[1]Ny,JeromeLe,Munther Dahleh,andEricFeron."Multi-agenttaskassignmentinthebanditframework."DecisionandControl,200645thIEEEConferenceon.IEEE,2006.
§ KagentstrackingN(N>K)targets:
SomeJargonTerms[1]
§ Arm:oneidea/strategy§ Bandit:Agroupofideas(strategies)§ Pull/Play/Trial:Onechancetotryyourstrategy§ Reward:Theunitofsuccesswemeasureaftereachpull§ Regret:PerformanceMetric
[1]BanditAlgorithmsforWebsiteOptimizationDeveloping,Deploying,andDebuggingBy JohnMylesWhite,O'ReillyMedia,2012
K-ArmedBandit
[1]CS246MiningMassiveDataSets2015,StanfordUniversity
§ EachArma§ Wins(reward=1)withfixed(unknown)prob.𝜇#§ Loses(reward=0)withfixed(unknown)prob.(1 − 𝜇#)
§ Alldrawsareindependentgiven𝜇(…𝜇*§ Howtopullarmstomaximizetotalreward?(estimatethe
arm’sprob.ofwinning𝜇#)
ModelofK-ArmedBandit
[1]CS246MiningMassiveDataSets2015,StanfordUniversity
§ Setof𝒌 choices(arms)§ Eachchoicea isassociatedwithunknownprobability
distribution𝑷𝒂 in[0,1]§ WeplaythegameforTrounds§ Ineachroundt:
§ Wepicksomearmj§ Weobtainrandomsample𝑿𝒕 from𝑷𝒋(rewardis
independentofpreviousdraws)§ Goal:maximize∑ 𝑿𝒕2
34( (withoutknown𝜇#)§ However,everytimewepullsomearmawegettolearnabit
about𝜇#.
PerformanceMetric:Regret
[1]CS246MiningMassiveDataSets2015,StanfordUniversity
§ Letbe𝜇# themeanof𝑷𝒂§ Payoff/reward bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}§ Let𝑖(, …𝑖2 bethesequenceofarmspulled§ Instantaneousregretattimet:𝑟3= 𝜇∗ −𝜇#>§ Totalregret:
§ 𝑹𝑻 =∑ 𝒓𝒕234(
§ Typicalgoal:armallocationstrategythatguarantees:
§𝑹𝑻2→ 0asT → ∞
AllocationStrategies
[1]CS246MiningMassiveDataSets2015,StanfordUniversity
§ Ifweknewthepayoffs,whicharmshouldwepull?§ bestarm:𝜇∗ = 𝒎𝒂𝒙 𝜇# 𝒂 = 𝟏,… , 𝒌}
§ Whatifweonlycareaboutestimatingpayoff𝜇#?
§ Pickeachofkarmsequallyoften:𝑻*
§ Estimate :𝜇#H = ∑ 𝑋#,JKLJ4( /(𝑻
*) 𝒌
2∑ 𝑋#,J2/*J4(
§ Totalregret:
§ 𝑹𝑻 =𝑻*∑ (𝜇∗−𝜇#)*
#4(
ExploitationvsExploration
§ Tradeoff:§ Onlyexploitation(makingdecisionsbasedonhistorydata),youwillhavebadestimationfor“best”items.
§ Onlyexploration(gatheringdataaboutarmpayoffs),youwillhavelowuser’sengagement.
AlgorithmtoExploration&Exploitation
Exploration Exploitation
Contextualfree Contextual
1. Epsilonalgorithm[1]2. UCB1[2]
1. Ex3,Ex42. TompsonSampling[3]3. LinUCB[4]
tradeoff
[1]WynnP.Ontheconvergenceandstabilityoftheepsilonalgorithm[J].SIAMJournalonNumericalAnalysis,1966,3(1):91-122.[2]AuerP,Cesa-BianchiN,FischerP.Finite-timeanalysisofthemulti-armedbanditproblem[J].Machinelearning,2002,47(2-3):235-256.[3]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.[4]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.
§ Ittriestobefairtothetwooppositegoalsofexploration(withprob.𝜀)andexploitation(1-𝜀)byusingamechanism:flipsacoin.
𝜀-GreedyAlgorithm
Roundt
Exploration
Exploitation(choosebestarm)
𝜺
1-𝜺
Arm𝑎∗
Arm𝑎1/k
1-𝜺
𝜺/k
Arm𝑏
𝜺/k1/k
§ Fort=1:T§ Set𝜀3 = 𝑂 (
3§ Withprob.𝜀3: Explorebypickinganarmchosen
uniformlyatrandom§ Withprob.1-𝜀3:Exploitbypickinganarmwith
highestempiricalmeanpayoff§ Theorem[Aueretal.‘02]
§ Forsuitablechoiceof𝜀3 itholdsthat
𝜀-GreedyAlgorithm
§ Notelegant”:Algorithmexplicitlydistinguishesbetweenexplorationandexploitation
§ Moreimportantly:Explorationmakessuboptimalchoices(sinceitpicksanyarmequallylikely)
§ Idea:Whenexploring/exploitingweneedtocomparearms.
Issueswith𝜀-GreedyAlgorithm
Example:ComparingArms
§ Supposewehavedoneexperiments:§ Arm1:1001110001§ Arm2:1§ Arm3:1101001111
§ Meanarmvalues:§ Arm1:5/10Arm2:1Arm3:7/10
§ Whicharmwouldyouchoosenext?§ Idea:Notonlylookatthemeanbutalsothe
confidence!
ConfidenceIntervals
§ Aconfidenceintervalisarangeofvalueswithinwhichwearesurethemeanlieswithacertainprobability§ Wecouldbelieve𝜇# iswithin[0.2,0.5]with
probability0.95§ Ifwewouldhavetriedanactionlessoften,our
estimatedrewardislessaccuratesotheconfidenceintervalislarger
§ Intervalshrinksaswegetmoreinformation(trytheactionmoreoften)
ConfidenceBasedSelection
§ Assumingweknowtheconfidenceintervals§ Then,insteadoftryingtheactionwiththehighestmeanwecan
trytheactionwiththehighestupperboundonitsconfidenceinterval.
ConfidenceintervalsvsSamplingtimes
Theestimationofconfidencebecomessmallerasthenumberofpullingtimesincreases.
[1]Jean-YvesAudibert andRemiMunos,IntroductiontoBandits:AlgorithmsandTheory.ICML2011,Bellevue(WA),USA
CalculatingConfidenceBounds
§ Supposewefixarma:§ Let𝑟#,(… 𝑟#,T bethepayoffsofarmainthefirstm
trials§ 𝑟#,(… 𝑟#,T arei.i.d.takingvaluesin[0,1]
§ Ourestimate:𝜇#,TU = 𝟏T∑ 𝑟#,JTJ4(
§ Wanttofindbsuchthatwithhighprobability𝜇# − 𝜇#,TU ≤ 𝑏 (wantbtobeassmallaspossible)
§ Goal:Wanttobound𝐏( 𝜇# − 𝜇#,TU ≤ 𝑏)
UCB1Algorithm
Hoeffding’s Inequality
§ UCB1(Upperconfidencesampling)algorithm§ Let𝜇(H … = 𝜇*H = 0and𝑚( =…=𝑚* =0
§ 𝜇#H isourestimateofpayoffofarm 𝑖§ 𝑚# is thenumberofpullsofarm𝑖 sofar.
§ Fort=1:T
§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab
�
§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)
UCB1Algorithm:Discussion
§ Confidenceintervalgrowswiththetotalnumberofactionstwehavetaken
§ ButShrinkswiththenumberoftimes𝑚# wehavetriedarma
§ Thisensureseacharmistriedinfinitelyoftenbutstillbalancesexplorationandexploitation
§ 𝛼 playstheroleof𝛿: 𝛼 = f mn= 1 + _`(^/o)
^
�
§ ForeacharmacalculateUCB a = 𝜇#H + 𝛼 ^_`>ab
�
§ Pickarm𝑗 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑈𝐶𝐵(𝑎)§ Pullarm𝑗 andobserve𝑦3§ 𝑚J =𝑚J +1and𝜇JH =1/𝑚J(𝑦3+(𝑚J−1)𝜇JH)
UCB1AlgorithmPerformance
§ Theorem[Aueretal.2002]§ Supposeoptimalmeanpayoffis§ Andforeacharmlet§ Thenitholdsthat
§ So,weget
ContextualBandits
§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms
togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm
choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach
observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)
LinUCBAlgorithm[1]
§ Contextualbanditalgorithminroundt§ Algorithmobserversuser𝒖𝒕 andaset𝐀 ofarms
togetherwiththeirfeatures𝒙𝒕,𝒂(context)§ Basedonpayoffsfromprevioustrials,algorithm
choosesarm𝒂 ∈ 𝐀 andreceivespayoff𝒓𝒕,𝒂§ Algorithmimprovesarmselectionstrategywitheach
observation(𝒙𝒕,𝒂, 𝒂,𝒓𝒕,𝒂)
[1]Li,Lihong,etal."Acontextual-banditapproachtopersonalizednewsarticlerecommendation." Proceedingsofthe19thinternationalconferenceonWorldwideweb.ACM,2010.
LinUCBAlgorithm
§ Expectationofrewardofeacharmismodeledasalinearfunctionofthecontext.
Payoffofarma:E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗
§ Thegoalistominimizeregret,definedasthedifferencebetweentheexpectationoftherewardofbestarmsandtheexpectationoftherewardofselectedarms.
𝑅3 𝑇 ≝ 𝐸 {𝑟3,#>∗
2
34(
− 𝐸[{𝑟3,#>
2
34(
]
𝒙𝒕,𝒂 isad-dimensionalfeaturevector
𝜽𝒂∗ istheunknowncoefficientvectorweaimtolearn
LinUCBAlgorithm
§ E 𝑟3,# 𝑥3,# = [𝑥3,#]2𝜃#∗§ Howtoestimate𝜃#?
§ Linearregressionsolutionto𝜃# is𝜽𝒂} = 𝒂𝒓𝒈𝒎𝒊𝒏𝜽 ∑ ([𝑥3,#]2𝜃# −𝒃𝒂
(𝒎))m�𝒎∈𝑫𝒂
Wecanget:𝜽𝒂} = (𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏 𝑫𝒂𝑻𝒃𝒂
𝑫𝒂 isam× dmatrixofmtraininginputs[𝑥3,#]
𝒃𝒂 isam-dimensionvectorofresponsesto𝒂(click/no-click)
LinUCBAlgorithm
§ UsingsimilartechniquesasweusedforUCB
|[𝑥3,#]2𝜽𝒂} − E 𝑟3,# 𝑥3,# | ≤ 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#�
§ Foragivencontext,weestimatetherewardandtheconfidenceinterval.
𝒂𝒕 ≝ 𝒂𝒓𝒈𝒎𝒂𝒙𝒂∈𝑨𝒕([𝑥3,#]2𝜽𝒂} + 𝜶 [𝑥3,#]2(𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅)�𝟏𝑥3,#
�)
𝜶 = 𝟏 + 𝒍𝒏(𝟐/𝜹)/𝟐�
Estimated𝜇# Confidenceinterval
LinUCBAlgorithm§ Initialization:𝐴# ≝ 𝑫𝒂𝑻𝑫𝒂 + 𝑰𝒅
§ Foreacharm𝑎:§ 𝐴# = 𝐼� //identitymatrixd×d§ 𝑏# = [0]� //vectorofzeros
§ Onlinealgorithm:§ Fort=[1:T]:
§ Observefeaturesforallarms𝑎 ∶ 𝑥3,# ∈ 𝑅�§ Foreacharm𝑎 ∶
§ 𝜃# = 𝐴#�(𝑏# //regressioncoefficients
§ 𝑝3,# = [𝑥3,#]2𝜃# + 𝜶 [𝑥3,#]2𝐴#�(𝑥3,#�
§ Choosearm𝑎3 = 𝑎𝑟𝑔𝑚𝑎𝑥#𝑝3,# //choosearm§ 𝐴#> = 𝐴#> + 𝑥3,#>[𝑥3,#>]
2 //updateAforthechosenarm𝑎3§ 𝑏#> = 𝑏#> + 𝑟3𝑥3,#> //updatebforthechosenarm𝑎3
LinUCB: Discussion
§ LinUCBcomputationalcomplexityis§ Linear inthenumberofarmsand§ Atmostcubicinthenumberoffeatures
§ LinUCBworkswellforadynamic armset(armscomandgo)§ Forexample,innewsarticlerecommendation,for
instance,editorsadd/removearticlesto/fromapool
DifferentbetweenUCB1andLinUCB
§ UCB1 directlyestimates𝜇# throughexperimentation(withoutanyknowledgeaboutarma)
§ LinUCB estimates𝜇# byregression𝜇# = [𝑥3,#]2𝜽𝒂∗§ Thehopeisthatwewillbeabletolearnfasteras
weconsiderthecontext𝑥#(user, ad) ofarma§ 𝜽𝒂∗ unknowncoefficientvectorweaimtolearn
ThompsonSampling
§ AsimplenaturalBayesianheuristic§ Maintainabelief(distribution)fortheunknown
parameters§ Eachtime,pullarma andobserveareward𝑟
§ Initializepriorsusingbeliefdistribution§ Fort=1:T:
§ SamplerandomvariableXfromeacharm’sbeliefdistribution
§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ Updatepriorbeliefdistributionforselectedarm
[1]AgrawalS,GoyalN.AnalysisofThompsonsamplingforthemulti-armedbanditproblem[J].arXiv preprintarXiv:1111.1797,2011.
SimpleExample
§ Cointoss:x̴Bernoulli(𝜃)§ Let’sassumethat
§ 𝜃̴Beta(𝛼£, 𝛼2)§ P(𝜃)∝ 𝜃¥¦�((1 − 𝜃)¥K�(
§ 𝑃 𝜃 𝑋 = ¨ 𝑋 𝜃 ¨(©)∑ ¨(ª|©)�«
Posterior
Prior
Thepriorisconjugate!
Betadistribution
ThompsonSamplingUsingBetabeliefdistribution§ Theorem[Emilieetal.2012]
§ Initiallyassumesarm𝒊 withpriorBeta(1,1)on𝝁𝒊§ 𝑆® =#“Success”,𝐹®=#“Failure”
ThompsonSamplingUsingBetabeliefdistribution
Arm1 Arm2 Arm3
Beta(1,1) Beta(1,1) Beta(1,1)
§ Initialization
ThompsonSamplingUsingBetabeliefdistribution
Arm1 Arm2 Arm3
Beta(1,1) Beta(1,1) Beta(1,1)
X0.70.20.4
§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta
Distribution
ThompsonSamplingUsingBetabeliefdistribution
Arm1 Arm2 Arm3
Beta(1,1) Beta(1,1) Beta(1,1)
X0.70.20.4
§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta
Distribution§ SelectthearmwithlargestX
ThompsonSamplingUsingBetabeliefdistribution
Arm1 Arm2 Arm3
Beta(1,1) Beta(1,1) Beta(1,1)
X0.70.20.4
§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta
Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm
Success!
ThompsonSamplingUsingBetabeliefdistribution
Arm1 Arm2 Arm3
Beta(2,1) Beta(1,1) Beta(1,1)
X0.70.20.4
§ Foreachround:§ SamplerandomvariableXfromeacharm’sBeta
Distribution§ SelectthearmwithlargestX§ Observetheresultofselectedarm§ UpdatepriorBetadistributionforselectedarmSuccess!
OurResearch1:EnsembleContextualBanditsforPersonalizedRecommendation
[1]Tang,Liang,etal."Ensemblecontextualbanditsforpersonalizedrecommendation." Proceedingsofthe8thACMConferenceonRecommendersystems.ACM,2014.
ProblemStatement
§ ProblemSetting: havemanydifferentrecommendationmodels(orpolicies):§ DifferentCTRPredictionAlgorithms.§ DifferentExploration-ExploitationAlgorithms.§ DifferentParameterChoices.
§ Nodatatodomodelvalidation§ ProblemStatement:howtobuildanensemblemodelthatis
closetothebestmodelinthecoldstartsituation?
HowEnsemble?
§ Classifierensemblemethoddoesnotworkinthissetting§ RecommendationdecisionisNOTpurelybasedonthepredictedCTR.
§ Eachindividualmodelonlytellsus:§ Whichitemtorecommend.
EnsembleMethod
§ OurMethod:§ Allocaterecommendationchancestoindividualmodels.
§ Problem:§ Bettermodelsshouldhavemorechances.§ Wedonotknowwhichoneisgoodorbadinadvance.§ Idealsolution:allocateallchancestothebestone.
CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.
§ Deployπ1,π2 …πm intotheonlinesystematthesametime.
§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRastheproductionmodel.
CurrentPractice:OnlineEvaluation(orA/Btesting)§ Letπ1,π2 …πm betheindividualmodels.
§ Deployπ1,π2 …πm intotheonlinesystematthesametime.
§ Dispatchasmallpercentusertraffictoeachmodel.§ Afteraperiod,choosethemodelhavingthebestCTRas
theproductionmodel.
Ifwehavetoomanymodels,thiswillhurttheperformanceoftheonlinesystem.
OurIdea1(HyperTS)
§ TheCTRofmodelπi isarandomunknownvariable,Ri .§ Goal:
§ maximize,rt isarandomnumberdrawnfromRs(t),s(t)=1,2,…,orm.Foreacht=1,…,N,wedecides(t).
§ Solution:§ BernoulliThompsonSampling (flatprior:beta(1,1)).
§ π1,π2 …πmarebanditarms.
1N
rtt=1
N
∑ CTRofourensemblemodel
Notrickyparameters
AnExampleofHyperTS
Inmemory,wekeeptheseestimatedCTRsforπ1,π2 …πm.
R1
R2
Rk
…
Rm
…
AnExampleofHyperTSAuservisit
HyperTSselectsacandidatemodel,πk .
EstimatedCTRs
R1
R2
Rk
…
Rm
…
AnExampleofHyperTSAuservisit
HyperTSselectsacandidatemodel,πk .
πk recommendsitemA totheuser.
A
xt::contextfeatures
EstimatedCTRs
R1
R2
Rk
…
Rm
…
AnExampleofHyperTSAuservisit
HyperTSselectsacandidatemodel,πk .
πk recommendsitemA totheuser.
A
xt::contextfeatures
EstimatedCTRs
R1
R2
Rk
…
Rm
…HyperTSupdatesthe
estimationofRk basedon rt.
update
Two-LayerDecision
BernoulliThompsonSampling
π1 π2 πmπk
Item A Item B Item C
… …
OurIdea2(HyperTSFB)
§ LimitationofPreviousIdea:§ Foreachrecommendation,userfeedbackisusedbyonlyoneindividualmodel (e.g.,πk).
§ Motivation:§ CanweupdateallR1,R2,…,Rm byeveryuserfeedback?(Shareeveryuserfeedbacktoeveryindividualmodel).
OurIdea2(HyperTSFB)
§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.
§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingA givenxt.
OurIdea2(HyperTSFB)
§ Assumeeachmodelcanoutputtheprobabilityofrecommendinganyitemgivenxt.§ E.g.,fordeterministicrecommendation,itis1or0.
§ Forauservisitxt:§ πk isselectedtoperformrecommendation(k=1,2,…,orm).§ ItemA isrecommendedbyπkgivenxt.§ Receiveauserfeedback(clickornotclick),rt.§ Askeverymodelπ1,π2 …πm,whatistheprobabilityofrecommendingAgivenxt.
EstimatetheCTRof π1,π2 …πm(ImportanceSampling)
ExperimentalSetup
§ ExperimentalData§ Yahoo!TodayNewsdatalogs(randomlydisplayed).§ KDDCup2012OnlineAdvertisingdataset.
§ EvaluationMethods§ Yahoo!TodayNews:Replay (seeLihongLiet.al’s WSDM2011paper).
§ KDDCup2012Data:Simulation byaLogisticRegressionModel.
ComparativeMethods
§ CTRPredictionAlgorithm§ LogisticRegression
§ Exploitation-ExplorationAlgorithms§ Random,ε-greedy,LinUCB,Softmax,Epoch-greedy,Thompsonsampling
§ HyperTSandHyperTSFB
ResultsforYahoo!NewsData
§ Every100,000impressionsareaggregatedintoabucket.
ResultsforYahoo!NewsData(Cont.)
Conclusions
§ Theperformanceofbaselineexploitation-explorationalgorithmsisverysensitivetotheparametersetting.§ Incold-startsituation,noenoughdatatotuneparameter.
§ HyperTSandHyperTSFBcanbeclosetotheoptimalbaselinealgorithm(Noguaranteebebetterthantheoptimalone),eventhoughsomebadindividualmodelsareincluded.
§ ForcontextualThompsonsampling,theperformancedependsonthechoiceofprior distributionforthelogisticregression.§ ForonlineBayesianlearning,theposteriordistribution
approximationisnotaccurate(cannotstorethepastdata).
OurResearch2:PersonalizedRecommendationviaParameter-FreeContextualBandits
[1] Tang,Liang,etal."Personalizedrecommendationviaparameter-freecontextualbandits." Proceedingsofthe38thInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval.ACM,2015.
HowtoBalanceTradeoff
§ Performanceismainlydeterminedbythetradeoff.Existingalgorithmsfindthetradeoffbyuserinputparametersanddatacharacteristics(e.g.,varianceoftheestimatedreward).
§ Existingalgorithmsareallparameter-sensitive.
Analgorithm
Bad
Goodalgorithmparameterisgood
algorithmparameterisbad
Chicken-and-EggProblemforExistingBanditAlgorithms
§ Whyweusebanditalgorithms?§ Solvethecoldstartproblem(Noenoughdataforestimatinguserpreferences).
§ Howtofindthebestinputparameters?§ Tunetheparametersonlineoroffline.
ifyoualreadyhavethedataoronlinetraffictotunetheparameters,whydoyouneedbandit
algorithms?
OurWork
§ Parameter-free:§ Itcanfindthetradeoffbydatacharacteristicsautomatically.
§ Robust:§ Existingalgorithmcanhaveverybadperformanceiftheinputparameterisnotappropriate.
Solution
§ ThompsonSampling§ Randomlyselectamodelcoefficientvectorfromposteriordistributionandfindthe“best”item.
§ Prior istheinputparameterforcomputingposterior.
§ Non-BayesianThompsonSampling(OurSolution)§ RandomlyselectabootstrapsampletofindtheMLEofmodelcoefficientandfindthe“best”item.
§ Bootstrappinghasnoinputparameter.
BootstrapBanditAlgorithm
Input:afeaturevectorx ofthecontext.Algorithm:
if each article has sufficient observations then {for each article i=1,…, k
i. Di ç randomly sample nk impression data of article i with replacement // Generate a bootstrap sample
ii. θiç MLE coefficient of Di // Model estimation on bootstrap sampleselect the article i* = argmax(f(x, θi)), i=1,…, k. to show.
}else{randomly select an article that has no sufficient observations to show.
}
Predictionfunction
OnlineBootstrapBandits
§ WhyOnlineBootstrap?§ Inefficienttogenerateabootstrapsampleforeachrecommendation.
§ Howtoonlinebootstrap?§ Keepthecoefficientestimatedbyeachbootstrapsampleinmemory.
§ Noneedtokeepallbootstrapsamplesinmemory.§ Whenanewdataarrives,incrementallyupdatetheestimatedcoefficientforeachbootstrapsample[1].
[1]N.C.Oza andS.Russell.Onlinebaggingandboosting.InIEEEinternationalconferenceonSystems,manandcybernetics,volume3,pages2340–2345,2005.
ExperimentData§ Twopublicdatasets
§ Newsrecommendationdata(Yahoo!TodayNews)§ NewsdisplayedontheYahoo!FrontPagefromOct.2nd,2011toOct.16th 2011.
§ 28,041,015uservisitevents.§ 136dimensionsoffeaturevectorforeachevent.
§ Onlineadvertisingdata(KDDCup2012,Track2)§ ThedatasetiscollectedbyasearchengineandpublishedbyKDDCup2012.
§ 1millionuservisitevents.§ 1,070,866dimensionsofthecontextfeaturevector.
OfflineEvaluationMetricandMethods§ Setup
§ OverallCTR(averagerewardofatrial).
§ EvaluationMethod§ TheexperimentonYahoo!TodayNewsisevaluatedbythereplaymethod[1].
§ TherewardonKDDCup2012ADdataissimulatedwithaweightvectorforeachAD[2].
[1]L.Li,W.Chu,J.Langford,andX.Wang.Unbiasedofflineevaluationofcontextual-bandit-basednewsarticlerecommendationalgorithms.InWSDM,pages297–306,2011.[2] O.Chapelle andL.Li.Anempiricalevaluationofthompson sampling.InNIPS,pages2249–2257,2011.
Experimental Methods
§ Ourmethod§ Bootstrap(B),whereB isthenumberofbootstrapsamples.
§ Baselines§ Random:itrandomlyselectsanarmtopull.§ Exploit:itonlyconsidertheexploitationwithoutexploration.§ ε-greedy(ε):ε istheprobabilityofexploration.§ LinUCB(α):itpullsthearmwithlargestscoredefinedbytheparameter
α§ TS(q0):Thompsonsamplingwithlogisticregression,whereq0-1 isthe
priorvariance,0isthepriormean.§ TSNR(q0):SimilartoTS(q0),butthelogisticregressionisnotregularized
bytheprior.
Experiment(Yahoo!NewsData)§ Allnumbersarerelativetotherandommodel.
Experiment(ADKDDCup’12)§ Allnumbersarerelativetotherandommodel.
CTRoverTimeBucket(Yahoo!NewsData)
CTRoverTimeBuckets(KDDCupAdsData)
Efficiency§ Timecostondifferentbootstrapsamplesizes
SummaryofExperiment
§ Summary§ Forsolvingthecontextualbanditproblem,thealgorithmsofє-greedyandLinUCBcanachievetheoptimalperformance,buttheinputparametersthatcontroltheexplorationneedtobetunedcarefully.
§ Theprobabilitymatchingstrategieshighlydependontheselectionoftheprior.
§ Ourproposedalgorithmisasafechoiceofbuildingpredictivemodelsforcontextualbanditproblemsunderthescenarioofcold-start.
Conclusion
§ Proposeanon-BayesianThompsonSamplingmethodtosolvethepersonalizedrecommendationproblem.
§ GiveboththeoreticalandempiricalanalysistoshowthattheperformanceofThompsonsamplingdependsonthechoiceoftheprior.
§ Conductextensiveexperimentsonrealdatasetstodemonstratetheefficacyoftheproposedmethod andothercontextualbanditalgorithms.
FutureWork
§ MABwithsimilarityinformation§ MABinachangingenvironment§ Explore-exploit tradeoffinmechanismdesign§ Explore-exploitlearningwithlimitedresources§ Riskvs.rewardtradeoffinMAB
[1]http://research.microsoft.com/en-us/projects/bandits/
QuestionandAnswer
Thanks!
top related