thompson sampling for web optimisation · 2019. 11. 25. · connetions, each of which is passed to...

Thompson sampling for web optimisation

29 Jan 2016David S. Leslie

Plan

• Contextual bandits on the web• Thompson sampling in bandits• Selecting multiple adverts

Plan

• Contextual bandits on the web• Thompson sampling in bandits• Selecting multiple adverts Optimising a web server

Contextual bandits . . .

• Receive state signal xt

• Select at from a finite set of actions A• Rewards stationary over time, but depend on both xt and at

rt = r(xt ,at ) + εt

. . . on the web

Natural solution method


• For each a ∈ A estimate the function r(·,a) of x using somestatistical procedure

• When xt is presented, calculate r̂t (xt ,a)

p(r(xt ,a) |Ht )

for each a and select an action

Objective

Maximise average reward, minimise regret, select “correct”actions eventually

Natural solution method


• For each a ∈ A estimate the function r(·,a) of x using somestatistical procedure

• When xt is presented, calculate

r̂t (xt ,a)

p(r(xt ,a) |Ht )for each a and select an action

Objective

Maximise average reward, minimise regret, select “correct”actions eventually

Simple bandits

L R• Receive state signal xt

• Finite set of actions a ∈ A

• Rewards stationary over time, but depend on xt and at

rt = r(at ) + εt

• Estimate r(L) and r(R) using very simple statistics• On trial t , calculate p(r(a) |Ht ) for each a and select an action

Solution methods

Full Bayesian decision theory (Gittins indices etc)

• Beautiful optimality theory• Action selected optimises the true objective• Marginalises over all possible future outcomes• Impossible to use in all but the simplest settings

Alternative approach

Heuristics to balance exploration and exploitation. Often involverandomisation

Undirected action selection

Select based purely on expected values r̂t (a)

Greedy: Action at maximises r̂t (a)

ε-greedy: Select greedy action with prob 1− ε, otherwiseexplore a random action

Softmax: P(at = a |Ht ) ∝ exp {r̂t (a)/τ}

Spot the difference!

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)

Solid lines are posterior density of the expected reward forred/blue actions. Dashed lines are the means of thesedistributions. Undirected methods treat left and right panelsidentically.

Myopic action selection

Give up on full optimality. Heuristics, usually using more than justr̂t (a), to explore ‘sensibly’

Optimism in face of uncertainty: create confidence intervals foreach action, select action with highest “top” of CI.

Thompson sampling: sample a value from the posterior for eachaction, select action with highest sample

Main ideaCI and posterior both narrow as more data have been observedfor that action: exploration more likely for less-visited actions.

Thompson sampling properties

Posteriors overaction values

→ Thompson sampling→ Probabilisticaction selection

P(at = a |Ht) = P(r(a) is maximal |Ht)

Proof idea:• Let Qt (a) ∼ p(r(a) |Ht )

• {at = a} = {Qt (a) > Qt (b) ∀b 6= a}




Suboptimal actions with high uncertainty are selected withlarger probability than those with low uncertainty

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)




Fixed posteriors for unplayed actions⇒ infinite explorationProof idea:• Suppose L is only played finitely often⇒

• posterior for r(L) freezes• R played infinitely often, and posterior for r(R) converges• so sampled values for R converge to r(R)

• So prob of playing L bounded below• So

∑t P(at = L |Ht ) =∞×××× (Borel–Cantelli)




Asymptotic average reward is maxa r(a)Proof idea:• Infinite exploration⇒ posteriors converge to r(a)

• For all large t , sampled values for a are close to r(a) with highprobability

• ∀ε > 0, prob of selecting best is larger than 1− ε for large t• Coupling argument⇒ average reward converges to max

ar(a)

TheoryMay, Korda, Lee and DL, JMLR 2012

TheoremIn

contextual

bandit problems with stationary reward functionsr(

x ,

a), if Thompson sampling is used then

limT→∞

∑Tt=1 r(

xt ,

at )∑Tt=1 maxa r(

xt ,

a)→ 1

(In English: The average reward is as good as it could be)

Cleverer theory: finite time regret properties, in more restrictedsettings (see Korda, Agrawal and others)

TheoryMay, Korda, Lee and DL, JMLR 2012

TheoremIn contextual bandit problems with stationary reward functionsr(x ,a), if Thompson sampling is used then

limT→∞

∑Tt=1 r(xt ,at )∑T

t=1 maxa r(xt ,a)→ 1

(In English: The average reward is as good as it could be)

Cleverer theory: finite time regret properties, in more restrictedsettings (see Korda, Agrawal and others)

A problem

• Let Qt (a) ∼ p(r(a) |Ht ) be sampled value for action a• Decompose as Qt (a) = r̂t (a) + Exploratory bonus• Thompson sampling gives negative exploratory bonuses ????

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

r

p(r|

H)

Reduced probability of selecting high variance optimal actions

Optimistic Bayesian SamplingMay, Korda, Lee and DL, JMLR 2012

• Let Qt (a) ∼ p(r(a) |Ht ) be sampled value for action a• Set QOBS

t (a) = max{Qt (a), r̂t (a)}• Select the action to maximise QOBS

All proofs go through as before

Emergent softwarewith Barry Porter and Matthew Grieves

App <interface>

WebServer

RequestHandler <interface>

RequestHandler RequestHandlerPT

HTTPHandler <interface>

HTTPHandler

HTTPHandlerCMP HTTPHandlerCHCMP HTTPHandlerCH

Compressor <interface>

GZip

Deflate

Cache <interface>

Cache

CacheLFU

CacheLRU

CacheFS

CacheMRU

CacheRR

Thread poolimplementation

Thread per clientimplementation Implementation without

caching or compression

Implementation withcompression

Implementation withcaching

Main method: opens a serversocket and accepts clientconnetions, each of which ispassed to a request handler.

Takes a client socket,applies a concurrencyapproach, and passesthe on socket to theHTTP handler.

Takes a clientsocket, parsesHTTP requestheaders andformulates aresponse.

Implementation withcaching and compression

Uh oh: trying each configuration only once takes 7 minutes. . .

Emergent softwarewith Barry Porter and Matthew Grieves

• Each component of the server can be provided by severalimplementations: 42 different valid configurations

• Configurations perform well under different traffic scenarios• Learn to use best configuration

Framework:Every 10 seconds, try a configuration, observe performance

Uh oh: trying each configuration only once takes 7 minutes. . .

Regression modelsimilar approach to Scott (2010)

Each component corresponds to a factor variable:

ResponseTime ∼RequestHandler + HTTPhandler + Compressor + Cache

A configuration conf corresponds to a binary vector xconf .

Expected response time for deploying conf is given by

xconfβ

where β is unknown.

Only 11 regression coefficients

Iterative decision-making

In each 10 second slot:

• Choose an action based on the fitted model• Observe the outcome• Add the observation to the pool of data• Update the statistical model

Challenge

Need to manage explore–exploit, as in simple bandits

Thompson sampling

Thompson sampling implementation:

Use Bayesian linear regression. Then for each t• sample a βTh from the posterior at time t• deploy conf which maximises xconfβTh

That’s it!

Initial resultsRepeatedly requesting a small text file

“Loss” is the difference between the reciprocal of the optimalresponse time at that instant, and the reciprocal of the actualresponse time

Changing request patternsLow/High text and Low/High Entropy

Different configurations are better for different request patterns

Changing request patternsAlternating traffic characteristics

The request pattern alternates, switching every 10 iterations.Poor performance.

Using contextCoding the context

At end of iteration t , categorise the traffic as HighEnt/LowEnt andas HighText/LowText.

Include Ent and Text as factors in the regressionAlso the interactions Ent:Cache and Text:Compressor

Performance under different traffic characteristics is learned

Using contextDecision-making

Thompson sampling implementation:

Use Bayesian linear regression. Then for each t• sample a βTh from the posterior at time t• deploy conf which maximises ((Entt−1,Textt−1) ? xconf)βTh

This makes the working assumption that(Entt ,Textt ) = (Entt−1,Textt−1)

Using contextResults

The request pattern alternates, switching every 10 iterations.Good performance.

Conclusion

• Contextual bandits and Thompson sampling: simple and(provably and empirically) effective

• Optimistic Bayesian sampling: removes negative exploratorybonus

• Extremely simple to deploy in more complicated settings• Basic statistical approaches are a revelation to (some) ‘Data

Scientists’

29 Jan 2016David S. Leslie

Backup slides

CopifyWith G Malhotra, W Simm and R McVey

• Marketplace matching copywriting jobs with authors• Copywriters select from the (ever-changing) available jobs

A Copify brief

The writer’s view

Copify’s challenge

The briefOffer appropriate jobs to a writer when they log in

Main differentiating features:

Jobs: a relatively small amount of free textWriters: history of jobs accepted/declined

Challenges include:• only light computation is allowed• zero to moderate data per writer• each job is completed by only one writer• a different set of available jobs on each login

Encoding a brief

Whenever a job arrives, it is coded into regression vector x ,consisting of:• price• reported topic category• (SVD compressed) ‘bag of semantic topics’ counts

Learning writer preferences

For each writer w , we know• which briefs they have been shown• which briefs they have accepted

Simple logistic regression to estimate writer ‘preferences’ β̂w andcovariance Σw = var(β̂w ). Updated each night for each writer.

If insufficient data (< 20 previous jobs) set β̂w and Σw to aglobally-estimated version with inflated covariance

Displaying jobs

On page load, there are jobs j = 1, . . . , J waiting to be accepted

Thompson sampling principle:

System selects job j with probability job j is the best

Implementation in regression framework:

• sample βTSw ∼ N(β̂w ,Σw ),

• select argmaxj

xjβTSw

Optimistic version: replace xjβTSw with max{xjβ

TSw , xj β̂w}

Displaying jobs

On page load, there are jobs j = 1, . . . , J waiting to be accepted

Thompson sampling principle:

System selects job j with probability job j is the best

Implementation in regression framework:

• sample βTSw ∼ N(β̂w ,Σw ),

• rank jobs according to xjβTSw

Optimistic version: replace xjβTSw with max{xjβ

TSw , xj β̂w}

Effectiveness

The new brief is ranked highly. It is for a blog post about fantasyfootball. This writer has completed many tasks to do with football.The editorial team also know the writer to be “football mad”.

Effectiveness

Hopefully some performance stats

thompson sampling for web optimisation · 2019. 11. 25. · connetions, each of which is passed to...

Documents