treatment allocations based on multi-armed … problemsmethodology and theorymodel...
TRANSCRIPT
![Page 1: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/1.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Treatment Allocations Based on Multi-Armed BanditStrategies
Wei Qian and Yuhong Yang
Applied Economics and Statistics, University of DelawareSchool of Statistics, University of Minnesota
Innovative Statistics and Machine Learning for Precision MedicineSeptember 15, 2017
![Page 2: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/2.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
1 Bandit Problems
2 Methodology and Theory
3 Model Combining
4 Numerical Studies
5 Conclusion
![Page 3: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/3.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Standard Multi-Armed Bandit Problem
There is a wall of slot machines.
!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.
Chances of winning are unknown to the game player.
At each time, one and only one machine can be played, and theimmediate result is observed.
Goal: maximize the total number of wins over N times of plays.
![Page 4: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/4.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Standard Multi-Armed Bandit Problem
There is a wall of slot machines.
!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.
Chances of winning are unknown to the game player.
At each time, one and only one machine can be played, and theimmediate result is observed.
Goal: maximize the total number of wins over N times of plays.
![Page 5: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/5.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Exploration-Exploitation Tradeoff
Exploration: pull each arm as many times as possible to explore onthe true reward probabilities.
Exploitation: use the existing information and play the “best” arm.
![Page 6: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/6.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Ethical Clinical Studies
Slot machines: different treatments to a certain disease
Survival probability: unknown to the doctor
Goal: sequentially assign treatments to patients to maximize thesurvival rate
![Page 7: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/7.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Real Example: ECMO Trial
ECMO for treating newborns with persistent pulmonary hypertension?
Ethical dilemma of using a conventional randomized controlled trialcurrent patients versus future patientstwo hats on a participating doctor
A solution is response adaptive design. L.J. Wei’s randomized versionof the play the winner rule was used in a study.
The ECMO trial has generated a lot of discussions. See, e.g., twoStatistical Science papers in 1989 and 1991.
![Page 8: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/8.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Online Services
Web applications are generating massive data streams.
Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.
![Page 9: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/9.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Online Services
Web applications are generating massive data streams.
Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.
![Page 10: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/10.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Bandit Problem For Online Services
Slot machines: multiple articles
Each internet visit: one and only one article delivered
Clicking probability: unknown to the internet company
Goal: sequentially choose an article for internet users to maximize thetotal number of clicks or click-through-rate (CTR)
![Page 11: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/11.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Bandit Problem With Covariates
Standard bandit problem assumes constant winning probabilities.
In practice, winning probability can be dependent on covariates.
Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.
![Page 12: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/12.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Bandit Problem With Covariates
Standard bandit problem assumes constant winning probabilities.
In practice, winning probability can be dependent on covariates.
Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.
![Page 13: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/13.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Personalized Web Service
Personalized online advertising, article recommendationInternet user’s interest in an ad or an article story can be associatedwith some user information.
![Page 14: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/14.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Multi-Armed Bandit with Covariate (MABC) for Precision Medicine
An example scenario:
A few FDA approved drugs are available on the market for treating acertain disease
Currently the doctors perhaps choose among the available drugs basedon limited information and reading of scattered publications if any
Why not use the MABC framework for better medical practice?
![Page 15: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/15.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Two-Armed Bandit Problem with Covariates
Two treatments (news articles): A and B
Patient (user) covariate x ∈ [0, 1]
Recovering (clicking) probability: fA(x), fB(x)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
clic
king
pro
babl
ityfA(x)fB(x)
![Page 16: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/16.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
![Page 17: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/17.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
![Page 18: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/18.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
![Page 19: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/19.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=
N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
![Page 20: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/20.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=
N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
![Page 21: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/21.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
![Page 22: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/22.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Assumptions of fA and fB
Parametric framework– Woodroofe, 1979; Auer, 2002; Li et al., 2010; Goldenshluger and Zeevi,
2009, 2013; Bastani and Bayati, 2016– Linear models
Nonparametric framework– Yang and Zhu, 2002; Rigollet and Zeevi, 2010; Perchet and Rigollet,
2013
![Page 23: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/23.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithms
Two articles A and B with clicking probabilities fA(x) and fB(x)
1 Deliver each article an equal number of times (e.g., each is deliveredn0 = 20 times):I1 = A, I2 = B, · · · , I2n0−1 = A, I2n0 = B.
2 For the next internet visit (n = 2n0 + 1), observe the internet usercovariate Xn.
3 Estimate fA and fB using previous data to obtain fA,n and fB,n.
4 Find the more promising option: in = argmaxi∈{A,B} fi,n(Xn);Deliver article with randomization scheme:
In =
{in, with probability 1− πn,i, with probability πn, i 6= in.
Observe the result YIn,n.
![Page 24: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/24.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Kernel Estimation
Given article A, at each time point n, define
JA,n = {j : Ij = A, 1 ≤ j ≤ n− 1}
Nadaraya-Watson estimator of fA(x):
fA,n(x) =
∑j∈JA,n
YA,jK(x−Xj
hn
)∑
j∈JA,n
K(x−Xj
hn
)kernel function K(u) : Rd → R; bandwidth hn
Epanechnikov quadratic kernel:
K(u) =3
4(1− u2)I(‖u‖ ≤ 1)
![Page 25: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/25.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
An UCB-Type Kernel Estimator
Upper Confidence Bound (UCB) kernel estimator
fA,n(x) =
∑j∈JA,n
YA,jK(x−Xj
hn
)∑
j∈JA,n
K(x−Xj
hn
) + UA,n(x)
A “standard error” quantity
UA,n(x) =
c
√(logN)
∑j∈JA,n
K2(x−Xjhn
)∑j∈JA,n
K(x−Xjhn
)Under uniform kernel K(u) = I(‖u‖∞ ≤ 1) withNA,n(x) =
∑j∈JA,n
I(‖Xj − x‖∞ ≤ h),
UA,n(x) = c
√logN
NA,n(x)
![Page 26: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/26.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X1 = 0.93, article A
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 1, nA = 1, nB = 0
x
clic
king
pro
babl
ity
![Page 27: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/27.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X1 = 0.93, article A
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 1, nA = 1, nB = 0
x
clic
king
pro
babl
ity
![Page 28: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/28.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X2 = 0.88, article B
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 2, nA = 1, nB = 1
x
clic
king
pro
babl
ity
![Page 29: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/29.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X2 = 0.88, article B
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 2, nA = 1, nB = 1
x
clic
king
pro
babl
ity
![Page 30: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/30.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 31: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/31.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
X41 = 0.52. Estimate fA(X41) and fB(X41) by kernel estimation.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 32: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/32.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 33: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/33.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41): consider a window [X41 − h,X41 + h].Similar information may give similar clicking probability.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 34: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/34.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41): consider a window [X41 − h,X41 + h].fA(X41) = 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 35: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/35.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fB(X41): consider a window [X41 − h,X41 + h].fB(X41) = 0.7996
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 36: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/36.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Article B looks more promising: fA(X41) < fB(X41).πn = 20%: P(I41 = B|H41) = 80%, P(I41 = A|H41) = 20%
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
![Page 37: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/37.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Continue the process with decreasing hn and πn to the end.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 800, nA = 349, nB = 451
x
clic
king
pro
babl
ity
![Page 38: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/38.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
![Page 39: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/39.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
![Page 40: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/40.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
![Page 41: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/41.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Asymptotic Performance
Theorem (Qian and Yang, JMLR, 2016a)
If fi’s (i ∈ {A,B}) are uniformly continuous, and hn and πn are chosen tosatisfy hn → 0, πn → 0 and nh2d
n π4n/(logn)3 →∞,
then Nadaraya-Watson estimators are uniformly strong consistent, that is,for each i ∈ {A,B},
supx∈[0,1]d
(fi,n(x)− fi(x)
)→ 0 a.s. as n→∞.
Estimation uniform strong consistency implies thatRN = o(N) almost surely.
Equivalently, ∑Nn=1 YIn,n∑Nn=1 Y
∗n
→ 1 a.s. as N →∞
![Page 42: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/42.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Analysis
Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h
|f(x1)− f(x2)|
Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)
Theorem (Qian and Yang, JMLR, 2016a)
There exists nδ � N such that with probability larger than 1− 2δ,
RN < C1nδ +N∑
n=nδ
(2 maxi∈{A,B}
ω(hn; fi) +
√C2 log(N)
nhdnπn+ πn
)+ C3
√N log
(1
δ
).
Upper bound of f∗(Xn)− fIn(Xn)– Estimation bias: ω(hn; fi)– Estimation variance: C2 log(N)/(nhdnπn)– Exploration price: πn
![Page 43: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/43.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Analysis
Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h
|f(x1)− f(x2)|
Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)
Theorem (Qian and Yang, JMLR, 2016a)
There exists nδ � N such that with probability larger than 1− 2δ,
RN < C1nδ +N∑
n=nδ
(2 maxi∈{A,B}
ω(hn; fi) +
√C2 log(N)
nhdnπn+ πn
)+ C3
√N log
(1
δ
).
Upper bound of f∗(Xn)− fIn(Xn)– Nonparametric estimation: Bias-Variance tradeoff– Bandit problem: Exploration-Exploitation tradeoff
![Page 44: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/44.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Upper Bounds
Under Holder continuity, when using the kernel UCB-type estimator,
ERN < CN1− 1
2+d/κ (logN)c.
– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)
up to a logarithmic factor.
Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).
![Page 45: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/45.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Upper Bounds
Under Holder continuity, when using the kernel UCB-type estimator,
ERN < CN1− 1
2+d/κ (logN)c.
– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)
up to a logarithmic factor.
Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).
![Page 46: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/46.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining
Different regression methods– kernel estimation, histogram, K-nearest neighbors– linear regression
Model combining: weighted average of different statistical models
AFTER (Yang, 2004):combines different forecasting procedures
Data-driven algorithm with robust performance
![Page 47: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/47.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining – Illustration
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
clic
king
pro
babl
ity
fA(x)fB(x)
fA(x) = 0.7e−30(x−0.2)2 + 0.7e−30(x−0.8)2
fB(x) = 0.65− 0.3x
Time horizon N = 800, πn = 1log2 n
Model Combining1 Nadaraya-Watson estimation (h1 and h2)2 Linear regression
![Page 48: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/48.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining – Adaptive Performance
Per-round regret rn = Rn/n
0 200 400 600 800
0.04
0.05
0.06
0.07
0.08
0.09
n
r ncombinedNadaraya-Watson-h1Nadaraya-Watson-h2linear regression
![Page 49: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/49.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Yahoo! Front Page Today Module Dataset
46 million internet visit events with user response and five usercovariates in ten days.
Contains a pool of about 10 editor-picked news articles.
Raw data file is 8GB each day.
Algorithms are implemented efficiently in C++.
Potentially adapted for online applications.
![Page 50: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/50.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Evaluation Results
Algorithms evaluated by click-through-rate (CTR).– Complete random– Naive simple average (no covariates)– LinUCB (Chapelle and Li, 2011):
Bayesian logistic regression based algorithm– Model combining:
Kernel estimation (h1 = n−1/6, h2 = n−1/8, h3 = n−1/10)Naive simple average
random Naive LinUCB Combining
avg. normalized CTR 1.00 1.189 1.225 1.237
std. dev. – 0.005 0.041 0.018
![Page 51: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/51.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Conclusion
Precision medicine demands “online” learning for optimal treatmentresults
MABC provides a framework for designing effective treatmentallocation rules in a way that integrates the learning fromexperimentation with maximizing the benefits to the patients along theprocess
Many theoretical and practical issues need to be addressed
![Page 52: Treatment Allocations Based on Multi-Armed … ProblemsMethodology and TheoryModel CombiningNumerical StudiesConclusion Treatment Allocations Based on Multi-Armed Bandit Strategies](https://reader033.vdocuments.site/reader033/viewer/2022051722/5aa1c92c7f8b9ab4208c2dc9/html5/thumbnails/52.jpg)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Some References
Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, 47, 235-256.
Lai, T. L. and Robbins, H. (1985), “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, 6, 4-22.
Perchet, V. and Rigollet, P. (2013), “The multi-armed bandit problem withcovariates,” The Annals of Statistics, 41, 693-721.
Qian, W. and Yang, Y. (2016a), “Kernel estimation and model combination in abandit problem with covariates,” Journal of Machine Learning Research, 17,1-37.
Qian, W. and Yang, Y. (2016b), “Randomized allocation with arm elimination ina bandit problem with covariates,” Electronic Journal of Statistics, 10, 242-270.
Robbins, H. (1954), “Some aspects of the sequential design of experiments,”.Bulletin of the American Mathematical Society, 58, 527-535.
Woodroofe, M. (1979), “A one-armed bandit problem with a concomitantvariable,” Journal of the American Statistical Association, 74, 799-806.
Yang, Y. (2004), “Combining forecasting procedures: some theoretical results,”Econometric Theory, 20, 176-222.
Yang, Y. and Zhu, D. (2002), “Randomized allocation with nonparametricestimation for a multi-armed bandit problem with covariates,” The Annals ofStatistics, 30, 100-121.
Yahoo! Academic Relations. (2011) Yahoo! front page today module user click logdataset, version 1.0.(Available from http://webscope.sandbox.yahoo.com.)