Thompson sampling for web optimisation
29 Jan 2016David S. Leslie
Plan
• Contextual bandits on the web• Thompson sampling in bandits• Selecting multiple adverts
Plan
• Contextual bandits on the web• Thompson sampling in bandits• Selecting multiple adverts Optimising a web server
Contextual bandits . . .
• Receive state signal xt
• Select at from a finite set of actions A• Rewards stationary over time, but depend on both xt and at
rt = r(xt ,at ) + εt
. . . on the web
Natural solution method
rt = r(xt ,at ) + εt
• For each a ∈ A estimate the function r(·,a) of x using somestatistical procedure
• When xt is presented, calculate r̂t (xt ,a)
p(r(xt ,a) |Ht )
for each a and select an action
Objective
Maximise average reward, minimise regret, select “correct”actions eventually
Natural solution method
rt = r(xt ,at ) + εt
• For each a ∈ A estimate the function r(·,a) of x using somestatistical procedure
• When xt is presented, calculate
r̂t (xt ,a)
p(r(xt ,a) |Ht )for each a and select an action
Objective
Maximise average reward, minimise regret, select “correct”actions eventually
Simple bandits
L R• Receive state signal xt
• Finite set of actions a ∈ A
• Rewards stationary over time, but depend on xt and at
rt = r(at ) + εt
• Estimate r(L) and r(R) using very simple statistics• On trial t , calculate p(r(a) |Ht ) for each a and select an action
Simple bandits
L R• Receive state signal xt
• Finite set of actions a ∈ A
• Rewards stationary over time, but depend on xt and at
rt = r(at ) + εt
• Estimate r(L) and r(R) using very simple statistics• On trial t , calculate p(r(a) |Ht ) for each a and select an action
Solution methods
Full Bayesian decision theory (Gittins indices etc)
• Beautiful optimality theory• Action selected optimises the true objective• Marginalises over all possible future outcomes• Impossible to use in all but the simplest settings
Alternative approach
Heuristics to balance exploration and exploitation. Often involverandomisation
Undirected action selection
Select based purely on expected values r̂t (a)
Greedy: Action at maximises r̂t (a)
ε-greedy: Select greedy action with prob 1− ε, otherwiseexplore a random action
Softmax: P(at = a |Ht ) ∝ exp {r̂t (a)/τ}
Spot the difference!
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
Solid lines are posterior density of the expected reward forred/blue actions. Dashed lines are the means of thesedistributions. Undirected methods treat left and right panelsidentically.
Myopic action selection
Give up on full optimality. Heuristics, usually using more than justr̂t (a), to explore ‘sensibly’
Optimism in face of uncertainty: create confidence intervals foreach action, select action with highest “top” of CI.
Thompson sampling: sample a value from the posterior for eachaction, select action with highest sample
Main ideaCI and posterior both narrow as more data have been observedfor that action: exploration more likely for less-visited actions.
Myopic action selection
Give up on full optimality. Heuristics, usually using more than justr̂t (a), to explore ‘sensibly’
Optimism in face of uncertainty: create confidence intervals foreach action, select action with highest “top” of CI.
Thompson sampling: sample a value from the posterior for eachaction, select action with highest sample
Main ideaCI and posterior both narrow as more data have been observedfor that action: exploration more likely for less-visited actions.
Myopic action selection
Give up on full optimality. Heuristics, usually using more than justr̂t (a), to explore ‘sensibly’
Optimism in face of uncertainty: create confidence intervals foreach action, select action with highest “top” of CI.
Thompson sampling: sample a value from the posterior for eachaction, select action with highest sample
Main ideaCI and posterior both narrow as more data have been observedfor that action: exploration more likely for less-visited actions.
Myopic action selection
Give up on full optimality. Heuristics, usually using more than justr̂t (a), to explore ‘sensibly’
Optimism in face of uncertainty: create confidence intervals foreach action, select action with highest “top” of CI.
Thompson sampling: sample a value from the posterior for eachaction, select action with highest sample
Main ideaCI and posterior both narrow as more data have been observedfor that action: exploration more likely for less-visited actions.
Thompson sampling properties
Posteriors overaction values
→ Thompson sampling→ Probabilisticaction selection
P(at = a |Ht) = P(r(a) is maximal |Ht)
Proof idea:• Let Qt (a) ∼ p(r(a) |Ht )
• {at = a} = {Qt (a) > Qt (b) ∀b 6= a}
Thompson sampling properties
Posteriors overaction values
→ Thompson sampling→ Probabilisticaction selection
Suboptimal actions with high uncertainty are selected withlarger probability than those with low uncertainty
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
Thompson sampling properties
Posteriors overaction values
→ Thompson sampling→ Probabilisticaction selection
Fixed posteriors for unplayed actions⇒ infinite explorationProof idea:• Suppose L is only played finitely often⇒
• posterior for r(L) freezes• R played infinitely often, and posterior for r(R) converges• so sampled values for R converge to r(R)
• So prob of playing L bounded below• So
∑t P(at = L |Ht ) =∞×××× (Borel–Cantelli)
Thompson sampling properties
Posteriors overaction values
→ Thompson sampling→ Probabilisticaction selection
Asymptotic average reward is maxa r(a)Proof idea:• Infinite exploration⇒ posteriors converge to r(a)
• For all large t , sampled values for a are close to r(a) with highprobability
• ∀ε > 0, prob of selecting best is larger than 1− ε for large t• Coupling argument⇒ average reward converges to max
ar(a)
TheoryMay, Korda, Lee and DL, JMLR 2012
TheoremIn
contextual
bandit problems with stationary reward functionsr(
x ,
a), if Thompson sampling is used then
limT→∞
∑Tt=1 r(
xt ,
at )∑Tt=1 maxa r(
xt ,
a)→ 1
(In English: The average reward is as good as it could be)
Cleverer theory: finite time regret properties, in more restrictedsettings (see Korda, Agrawal and others)
TheoryMay, Korda, Lee and DL, JMLR 2012
TheoremIn contextual bandit problems with stationary reward functionsr(x ,a), if Thompson sampling is used then
limT→∞
∑Tt=1 r(xt ,at )∑T
t=1 maxa r(xt ,a)→ 1
(In English: The average reward is as good as it could be)
Cleverer theory: finite time regret properties, in more restrictedsettings (see Korda, Agrawal and others)
A problem
• Let Qt (a) ∼ p(r(a) |Ht ) be sampled value for action a• Decompose as Qt (a) = r̂t (a) + Exploratory bonus• Thompson sampling gives negative exploratory bonuses ????
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
Reduced probability of selecting high variance optimal actions
A problem
• Let Qt (a) ∼ p(r(a) |Ht ) be sampled value for action a• Decompose as Qt (a) = r̂t (a) + Exploratory bonus• Thompson sampling gives negative exploratory bonuses ????
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
0 5 10 15 20 25 30
0.0
0.1
0.2
0.3
0.4
r
p(r|
H)
Reduced probability of selecting high variance optimal actions
Optimistic Bayesian SamplingMay, Korda, Lee and DL, JMLR 2012
• Let Qt (a) ∼ p(r(a) |Ht ) be sampled value for action a• Set QOBS
t (a) = max{Qt (a), r̂t (a)}• Select the action to maximise QOBS
All proofs go through as before
Emergent softwarewith Barry Porter and Matthew Grieves
App <interface>
WebServer
RequestHandler <interface>
RequestHandler RequestHandlerPT
HTTPHandler <interface>
HTTPHandler
HTTPHandlerCMP HTTPHandlerCHCMP HTTPHandlerCH
Compressor <interface>
GZip
Deflate
Cache <interface>
Cache
CacheLFU
CacheLRU
CacheFS
CacheMRU
CacheRR
Thread poolimplementation
Thread per clientimplementation Implementation without
caching or compression
Implementation withcompression
Implementation withcaching
Main method: opens a serversocket and accepts clientconnetions, each of which ispassed to a request handler.
Takes a client socket,applies a concurrencyapproach, and passesthe on socket to theHTTP handler.
Takes a clientsocket, parsesHTTP requestheaders andformulates aresponse.
Implementation withcaching and compression
Uh oh: trying each configuration only once takes 7 minutes. . .
Emergent softwarewith Barry Porter and Matthew Grieves
• Each component of the server can be provided by severalimplementations: 42 different valid configurations
• Configurations perform well under different traffic scenarios• Learn to use best configuration
Framework:Every 10 seconds, try a configuration, observe performance
Uh oh: trying each configuration only once takes 7 minutes. . .
Regression modelsimilar approach to Scott (2010)
Each component corresponds to a factor variable:
ResponseTime ∼RequestHandler + HTTPhandler + Compressor + Cache
A configuration conf corresponds to a binary vector xconf .
Expected response time for deploying conf is given by
xconfβ
where β is unknown.
Only 11 regression coefficients
Iterative decision-making
In each 10 second slot:
• Choose an action based on the fitted model• Observe the outcome• Add the observation to the pool of data• Update the statistical model
Challenge
Need to manage explore–exploit, as in simple bandits
Thompson sampling
Thompson sampling implementation:
Use Bayesian linear regression. Then for each t• sample a βTh from the posterior at time t• deploy conf which maximises xconfβTh
That’s it!
Initial resultsRepeatedly requesting a small text file
“Loss” is the difference between the reciprocal of the optimalresponse time at that instant, and the reciprocal of the actualresponse time
Changing request patternsLow/High text and Low/High Entropy
Different configurations are better for different request patterns
Changing request patternsAlternating traffic characteristics
The request pattern alternates, switching every 10 iterations.Poor performance.
Using contextCoding the context
At end of iteration t , categorise the traffic as HighEnt/LowEnt andas HighText/LowText.
Include Ent and Text as factors in the regressionAlso the interactions Ent:Cache and Text:Compressor
Performance under different traffic characteristics is learned
Using contextDecision-making
Thompson sampling implementation:
Use Bayesian linear regression. Then for each t• sample a βTh from the posterior at time t• deploy conf which maximises ((Entt−1,Textt−1) ? xconf)βTh
This makes the working assumption that(Entt ,Textt ) = (Entt−1,Textt−1)
Using contextResults
The request pattern alternates, switching every 10 iterations.Good performance.
Conclusion
• Contextual bandits and Thompson sampling: simple and(provably and empirically) effective
• Optimistic Bayesian sampling: removes negative exploratorybonus
• Extremely simple to deploy in more complicated settings• Basic statistical approaches are a revelation to (some) ‘Data
Scientists’
29 Jan 2016David S. Leslie
Backup slides
CopifyWith G Malhotra, W Simm and R McVey
• Marketplace matching copywriting jobs with authors• Copywriters select from the (ever-changing) available jobs
A Copify brief
A Copify brief
A Copify brief
The writer’s view
Copify’s challenge
The briefOffer appropriate jobs to a writer when they log in
Main differentiating features:
Jobs: a relatively small amount of free textWriters: history of jobs accepted/declined
Challenges include:• only light computation is allowed• zero to moderate data per writer• each job is completed by only one writer• a different set of available jobs on each login
Encoding a brief
Whenever a job arrives, it is coded into regression vector x ,consisting of:• price• reported topic category• (SVD compressed) ‘bag of semantic topics’ counts
Learning writer preferences
For each writer w , we know• which briefs they have been shown• which briefs they have accepted
Simple logistic regression to estimate writer ‘preferences’ β̂w andcovariance Σw = var(β̂w ). Updated each night for each writer.
If insufficient data (< 20 previous jobs) set β̂w and Σw to aglobally-estimated version with inflated covariance
Displaying jobs
On page load, there are jobs j = 1, . . . , J waiting to be accepted
Thompson sampling principle:
System selects job j with probability job j is the best
Implementation in regression framework:
• sample βTSw ∼ N(β̂w ,Σw ),
• select argmaxj
xjβTSw
Optimistic version: replace xjβTSw with max{xjβ
TSw , xj β̂w}
Displaying jobs
On page load, there are jobs j = 1, . . . , J waiting to be accepted
Thompson sampling principle:
System selects job j with probability job j is the best
Implementation in regression framework:
• sample βTSw ∼ N(β̂w ,Σw ),
• rank jobs according to xjβTSw
Optimistic version: replace xjβTSw with max{xjβ
TSw , xj β̂w}
Effectiveness
The new brief is ranked highly. It is for a blog post about fantasyfootball. This writer has completed many tasks to do with football.The editorial team also know the writer to be “football mad”.
Effectiveness
Hopefully some performance stats