contextual bandit exploration
TRANSCRIPT
![Page 1: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/1.jpg)
Contextual Bandit Exploration
John Langford, Microsoft Research, NYC
Machine Learning the Future, March 13, 2017
![Page 2: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/2.jpg)
Reminder: Contextual Bandit Setting
For t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actions given context.
What does learning mean?
Efficiently competing with some largereference class of policies Π = π : X → A:
Regret = maxπ∈Π
averaget(rπ(x) − ra)
![Page 3: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/3.jpg)
Reminder: Contextual Bandit Setting
For t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 The learner chooses an action a ∈ A
3 The world reacts with reward ra ∈ [0, 1]
Goal: Learn a good policy for choosing actions given context.
What does learning mean? Efficiently competing with some largereference class of policies Π = π : X → A:
Regret = maxπ∈Π
averaget(rπ(x) − ra)
![Page 4: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/4.jpg)
What is exploration?
Exploration = Choosing not-obviously best actions to gatherinformation for better performance in the future.
There are two kinds:
1 Deterministic. Choose action A, then B, then C , then A, thenB, ...
2 Randomized. Choose random actions according to somedistribution over actions.
We discuss Randomized here.
1 There are no good deterministic exploration algorithms in thissetting.
2 Supports off-policy evaluation.
3 Randomize = robust to delayed updates, which are verycommon in practice.
![Page 5: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/5.jpg)
What is exploration?
Exploration = Choosing not-obviously best actions to gatherinformation for better performance in the future.There are two kinds:
1 Deterministic. Choose action A, then B, then C , then A, thenB, ...
2 Randomized. Choose random actions according to somedistribution over actions.
We discuss Randomized here.
1 There are no good deterministic exploration algorithms in thissetting.
2 Supports off-policy evaluation.
3 Randomize = robust to delayed updates, which are verycommon in practice.
![Page 6: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/6.jpg)
What is exploration?
Exploration = Choosing not-obviously best actions to gatherinformation for better performance in the future.There are two kinds:
1 Deterministic. Choose action A, then B, then C , then A, thenB, ...
2 Randomized. Choose random actions according to somedistribution over actions.
We discuss Randomized here.
1 There are no good deterministic exploration algorithms in thissetting.
2 Supports off-policy evaluation.
3 Randomize = robust to delayed updates, which are verycommon in practice.
![Page 7: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/7.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.
2 Choose a uniform randomly.
3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ
![Page 8: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/8.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.
2 Choose a uniform randomly.
3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ
![Page 9: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/9.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.
2 Choose a uniform randomly.
3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ
![Page 10: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/10.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.
2 Choose a uniform randomly.
3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ
![Page 11: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/11.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.
2 Choose a uniform randomly.
3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ
![Page 12: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/12.jpg)
Explore τ then Follow the Leader (Explore-τ)
Initially, h = ∅For the first τ rounds
1 Observe x.2 Choose a uniform randomly.3 Observe r , and add (x , a, r , 1/|A|) to h.
For the next T − τ rounds, use empirical best.
Suppose all examples are drawn from a fixed distribution D(x , ~r).
Theorem: For all D,Π,Explore-τ has regret O
(τT +
√|A| ln |Π|
τ
)with high probability.
Proof: After τ rounds, large deviation bound
⇒∣∣V (π)− E(x ,~r)∼D [rπ(x)
∣∣ ≤√ |A| ln(|Π|/δ)
τ
so regret bounded by τT + T−τ
T
√|A| ln(|Π|/δ)
τ = O
((|A| ln |Π|
T
)1/3)
at best τ .
![Page 13: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/13.jpg)
Explore-τ summary
1 +Easiest approach: offline prerecorded exploration can feedinto any learning algorithm.
2 -Doesn’t adapt when world changes.
3 -Underexploration common. A clinical trial problem: no newinformation after initial exploration.
4 -Overexploration common. Explores obviously suboptimalchoices.
Can we do better?
![Page 14: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/14.jpg)
Explore-τ summary
1 +Easiest approach: offline prerecorded exploration can feedinto any learning algorithm.
2 -Doesn’t adapt when world changes.
3 -Underexploration common. A clinical trial problem: no newinformation after initial exploration.
4 -Overexploration common. Explores obviously suboptimalchoices.
Can we do better?
![Page 15: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/15.jpg)
ε-Greedy
1 Observe x.2 With probability 1− ε
1 Choose learned a2 Observe r , and learn with (x , a, r , 1− ε).
With probability ε1 Choose Uniform random other a2 Observe r , and learn with (x , a, r , ε/(|A| − 1)).
Theorem: ε-Greedy has regret O
(ε+
√|A| ln |Π|
T ε
)For optimal epsilon? O
((|A| ln |Π|
T
)1/3)
![Page 16: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/16.jpg)
ε-Greedy
1 Observe x.2 With probability 1− ε
1 Choose learned a2 Observe r , and learn with (x , a, r , 1− ε).
With probability ε1 Choose Uniform random other a2 Observe r , and learn with (x , a, r , ε/(|A| − 1)).
Theorem: ε-Greedy has regret O
(ε+
√|A| ln |Π|
T ε
)For optimal epsilon? O
((|A| ln |Π|
T
)1/3)
![Page 17: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/17.jpg)
ε-Greedy
1 Observe x.2 With probability 1− ε
1 Choose learned a2 Observe r , and learn with (x , a, r , 1− ε).
With probability ε1 Choose Uniform random other a2 Observe r , and learn with (x , a, r , ε/(|A| − 1)).
Theorem: ε-Greedy has regret O
(ε+
√|A| ln |Π|
T ε
)
For optimal epsilon? O
((|A| ln |Π|
T
)1/3)
![Page 18: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/18.jpg)
ε-Greedy
1 Observe x.2 With probability 1− ε
1 Choose learned a2 Observe r , and learn with (x , a, r , 1− ε).
With probability ε1 Choose Uniform random other a2 Observe r , and learn with (x , a, r , ε/(|A| − 1)).
Theorem: ε-Greedy has regret O
(ε+
√|A| ln |Π|
T ε
)For optimal epsilon?
O
((|A| ln |Π|
T
)1/3)
![Page 19: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/19.jpg)
ε-Greedy
1 Observe x.2 With probability 1− ε
1 Choose learned a2 Observe r , and learn with (x , a, r , 1− ε).
With probability ε1 Choose Uniform random other a2 Observe r , and learn with (x , a, r , ε/(|A| − 1)).
Theorem: ε-Greedy has regret O
(ε+
√|A| ln |Π|
T ε
)For optimal epsilon? O
((|A| ln |Π|
T
)1/3)
![Page 20: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/20.jpg)
ε-Greedy summary
1 -Harder Approach: Need online learning algorithm to use.
2 +Adapts when world changes.
3 -Overexploration common. Bad possibilities keep beingexplored.
4 +Can be adaptive. Epoch Greedy = Epsilon Greedy withadaptive ε.
Can we do better?
![Page 21: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/21.jpg)
ε-Greedy summary
1 -Harder Approach: Need online learning algorithm to use.
2 +Adapts when world changes.
3 -Overexploration common. Bad possibilities keep beingexplored.
4 +Can be adaptive. Epoch Greedy = Epsilon Greedy withadaptive ε.
Can we do better?
![Page 22: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/22.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.
Problem: Posteriors are intractable.Solution: Treat bootstrap as posterior.Problem: Too much varianceSolution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 23: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/23.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.Problem: Posteriors are intractable.
Solution: Treat bootstrap as posterior.Problem: Too much varianceSolution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 24: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/24.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.Problem: Posteriors are intractable.Solution: Treat bootstrap as posterior.
Problem: Too much varianceSolution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 25: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/25.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.Problem: Posteriors are intractable.Solution: Treat bootstrap as posterior.Problem: Too much variance
Solution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 26: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/26.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.Problem: Posteriors are intractable.Solution: Treat bootstrap as posterior.Problem: Too much varianceSolution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 27: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/27.jpg)
Better 1: Bagging Thompson Sampling
Maintain Bayesian posterior over policies.On each round sample policy from posterior, act with policy.Problem: Posteriors are intractable.Solution: Treat bootstrap as posterior.Problem: Too much varianceSolution: Using bagging to compute probability of action
Bagging Thompson Sampling
For each t = 1, 2, . . .
1 Observe x
2 Let p(a|x) = Prπ(π(x) = a)
3 Choose a ∼ p(a|x)
4 Observe reward r .
5 For each π update Poisson(1) times with (x , a, r , p(a|x)).
![Page 28: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/28.jpg)
What does it mean?
1 +Avoids unnecessary exploration
2 +Known to work well empirically sometimes.
3 -Heuristic relatively weak theoretical guarantees.
4 -Underexplores sometimes.
![Page 29: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/29.jpg)
Better 2: A Cover Algorithm
Let Q1 = uniform distribution
For t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 Draw π ∼ Qt
3 The learner chooses an action a ∈ A
using π(x).
4 The world reacts with reward ra ∈ [0, 1]
5 Update Qt+1
![Page 30: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/30.jpg)
Better 2: A Cover Algorithm
Let Q1 = uniform distributionFor t = 1, . . . ,T :
1 The world produces some context x ∈ X
2 Draw π ∼ Qt
3 The learner chooses an action a ∈ A using π(x).
4 The world reacts with reward ra ∈ [0, 1]
5 Update Qt+1
![Page 31: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/31.jpg)
What is good Qt?
Exploration: Qt allows discovery of good policies
Exploitation: Qt small on bad policies
At time t define
Empirical regret Regrett(π) = maxπ′ V (π′)− V (π)
Minimum probability µ =√
ln Πt|A|
Regret certainty bπ = Regrett(π)/(100µ)
Find a distribution Q over policies π satisfying
![Page 32: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/32.jpg)
What is good Qt?
Exploration: Qt allows discovery of good policies
Exploitation: Qt small on bad policies
At time t define
Empirical regret Regrett(π) = maxπ′ V (π′)− V (π)
Minimum probability µ =√
ln Πt|A|
Regret certainty bπ = Regrett(π)/(100µ)
Find a distribution Q over policies π satisfying
![Page 33: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/33.jpg)
What is good Qt?
Exploration: Qt allows discovery of good policies
Exploitation: Qt small on bad policies
At time t define
Empirical regret Regrett(π) = maxπ′ V (π′)− V (π)
Minimum probability µ =√
ln Πt|A|
Regret certainty bπ = Regrett(π)/(100µ)
Find a distribution Q over policies π satisfying
Estimated regret ≤ (small)
Estimated variance ≤ (small for good π)
![Page 34: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/34.jpg)
What is good Qt?
Exploration: Qt allows discovery of good policies
Exploitation: Qt small on bad policies
At time t define
Empirical regret Regrett(π) = maxπ′ V (π′)− V (π)
Minimum probability µ =√
ln Πt|A|
Regret certainty bπ = Regrett(π)/(100µ)
Find a distribution Q over policies π satisfying∑π∈Π
Q(π)bπ ≤ 2|A|
∀π ∈ Π :1
t
t∑τ=1
[1
Qµ(π(xτ )|xτ )
]≤ 2|A|+ bπ
![Page 35: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/35.jpg)
How do you find Qt?
by Reduction to ArgMax Oracle (AMO).
Definition
Given a set of policies Π and data (x1, v1), . . . , (xt , vt), AMOreturns
arg maxπ∈Π
t∑τ=1
vτ (π(xτ ))
Use:
vt(a) = Valuet(a) +100µ
Qµ(a|xt)to get worst constraint violater.
Mixture of successive constraint violaters = Qt . Very fun to prove!
![Page 36: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/36.jpg)
How do you find Qt?
by Reduction to ArgMax Oracle (AMO).
Definition
Given a set of policies Π and data (x1, v1), . . . , (xt , vt), AMOreturns
arg maxπ∈Π
t∑τ=1
vτ (π(xτ ))
Use:
vt(a) = Valuet(a) +100µ
Qµ(a|xt)to get worst constraint violater.
Mixture of successive constraint violaters = Qt . Very fun to prove!
![Page 37: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/37.jpg)
How do you find Qt?
by Reduction to ArgMax Oracle (AMO).
Definition
Given a set of policies Π and data (x1, v1), . . . , (xt , vt), AMOreturns
arg maxπ∈Π
t∑τ=1
vτ (π(xτ ))
Use:
vt(a) = Valuet(a) +100µ
Qµ(a|xt)to get worst constraint violater.
Mixture of successive constraint violaters = Qt . Very fun to prove!
![Page 38: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/38.jpg)
Theorem: Optimal in all ways
Regret: O
(√|A| ln |Π|
T
)
Calls to Cost sensitive classification oracle: O(T 0.5
)(< T !)
Lower bound: Ω(T 0.5
)calls to oracle
Running time: O(T 1.5
)
![Page 39: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/39.jpg)
Trying it out
Change rcv1 CCAT-or-not to be classes 1 and 2
vw --cbify 2 rcv1.train.multiclass.vw -c --epsilon 0.1Progressive 0/1 loss: 0.156
vw --cbify 2 rcv1.train.multiclass.vw -c --first 20000Progressive 0/1 loss: 0.082
vw --cbify 2 rcv1.train.multiclass.vw -c --bag 16 -b 22Progressive 0/1 loss: 0.059
vw --cbify 2 rcv1.train.multiclass.vw -c --cover 1Progressive 0/1 loss: 0.053
ε-greedy Initial Bagging LinUCB Online Cover SupervisedLoss 0.148 0.081 0.059 0.128 0.053 0.051time 17s 2.6s 275s 60h 12s 5.3s
![Page 40: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/40.jpg)
Trying it out
Change rcv1 CCAT-or-not to be classes 1 and 2
vw --cbify 2 rcv1.train.multiclass.vw -c --epsilon 0.1Progressive 0/1 loss: 0.156
vw --cbify 2 rcv1.train.multiclass.vw -c --first 20000Progressive 0/1 loss: 0.082
vw --cbify 2 rcv1.train.multiclass.vw -c --bag 16 -b 22Progressive 0/1 loss: 0.059
vw --cbify 2 rcv1.train.multiclass.vw -c --cover 1Progressive 0/1 loss: 0.053
ε-greedy Initial Bagging LinUCB Online Cover SupervisedLoss 0.148 0.081 0.059 0.128 0.053 0.051time 17s 2.6s 275s 60h 12s 5.3s
![Page 41: Contextual Bandit Exploration](https://reader031.vdocuments.site/reader031/viewer/2022012504/617eef73c31a8f4d554ca3ae/html5/thumbnails/41.jpg)
Bibliography
Tau-first Unclear first use?
ε-Greedy Unclear first use?
Online Bag N. Oza and S. Russell, Online bagging and boosting, AI&Stat2001.
Epoch J. Langford and T. Zhang, The Epoch-Greedy Algorithm forContextual Multi-armed Bandits, NIPS 2007.
Thompson W. R. Thompson. On the likelihood that one unknownprobability exceeds another in view of the evidence of twosamples. Biometrika, 25(3-4):285294, 1933.
Cover/Bag A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire,Taming the Monster: A Fast and Simple Algorithm forContextual Bandits, ICML 2014.
Bootstrap D. Eckles and M. Kaptein, Thompson Sampling with OnlineBootstrap, arxiv.org/1410.4009