counterfactual model interactive system schematic for ......âlong turnaround time ð evaluating...
TRANSCRIPT
![Page 1: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/1.jpg)
1
Counterfactual Model for Online Systems
CS 7792 - Fall 2016
Thorsten Joachims
Department of Computer Science & Department of Information Science
Cornell University
Imbens, Rubin, Causal Inference for Statistical Social Science, 2015. Chapters 1,3,12.
Interactive System Schematic
Action y for x
System Ï0
Utility: ð ð0
News Recommender
⢠Context ð¥:
â User
⢠Action ðŠ:
â Portfolio of newsarticles
⢠Feedback ð¿ ð¥, ðŠ :
â Reading time in minutes
Ad Placement
⢠Context ð¥:
â User and page
⢠Action ðŠ:
â Ad that is placed
⢠Feedback ð¿ ð¥, ðŠ :
â Click / no-click
Search Engine
⢠Context ð¥:
â Query
⢠Action ðŠ:
â Ranking
⢠Feedback ð¿ ð¥, ðŠ :
â win/loss against baseline in interleaving
Log Data from Interactive Systems ⢠Data
ð = ð¥1, ðŠ1, ð¿1 , ⊠, ð¥ð, ðŠð, ð¿ð
Partial Information (aka âContextual Banditâ) Feedback
⢠Properties â Contexts ð¥ð drawn i.i.d. from unknown ð(ð) â Actions ðŠð selected by existing system ð0: ð â ð â Feedback ð¿ð from unknown function ð¿: ð à ð â â
context Ï0 action reward / loss
[Zadrozny et al., 2003] [Langford & Li], [Bottou, et al., 2014]
![Page 2: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/2.jpg)
2
Goal
⢠Use interaction log data
ð = ð¥1, ðŠ1, ð¿1 , ⊠, ð¥ð , ðŠð , ð¿ð
for evaluation of system ð: ⢠Estimate online measures of some system ð offline.
⢠System ð can be different from ð0 that generated log.
Evaluation: Outline ⢠Evaluating Online Metrics Offline
â A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
⢠Approach 1: âModel the worldâ â Estimation via reward prediction
⢠Approach 2: âModel the biasâ â Counterfactual Model
â Inverse propensity scoring (IPS) estimator
Online Performance Metrics Example metrics
â CTR â Revenue â Time-to-success â Interleaving â Etc.
Correct choice depends on application and is not the focus of this lecture.
This lecture: Metric encoded as ÎŽ(ð¥, ðŠ) [click/payoff/time for (x,y) pair]
System ⢠Definition [Deterministic Policy]:
Function ðŠ = ð(ð¥)
that picks action ðŠ for context ð¥. ⢠Definition [Stochastic Policy]:
Distribution ð ðŠ ð¥
that samples action ðŠ given context ð¥ ð1(ð|ð¥) ð2(ð|ð¥)
ð|ð¥
Ï1 ð¥ Ï2 ð¥
ð|ð¥
System Performance
Definition [Utility of Policy]:
The expected reward / utility U(ð) of policy ð is
U ð = ð¿ ð¥, ðŠ ð ðŠ ð¥ ð ð¥ ðð¥ ððŠ
ð(ð|ð¥ð)
ð|ð¥ð
ð(ð|ð¥ð)
ð|ð¥ð
⊠e.g. reading
time of user x for portfolio y
Online Evaluation: A/B Testing Given ð = ð¥1, ðŠ1, ð¿1 , ⊠, ð¥ð , ðŠð , ð¿ð collected under Ï0, A/B Testing
Deploy Ï1: Draw ð¥ ⌠ð ð , predict ðŠ ⌠ð1 ð ð¥ , get ð¿(ð¥, ðŠ) Deploy Ï2: Draw ð¥ ⌠ð ð , predict ðŠ ⌠ð2 ð ð¥ , get ð¿(ð¥, ðŠ) â® Deploy Ï|ð»|: Draw ð¥ ⌠ð ð , predict ðŠ ⌠ð|ð»| ð ð¥ , get ð¿(ð¥, ðŠ)
ð Ï0 =1
ð ð¿ð
ð
ð=1
![Page 3: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/3.jpg)
3
Pros and Cons of A/B Testing ⢠Pro
â User centric measure â No need for manual ratings â No user/expert mismatch
⢠Cons â Requires interactive experimental control â Risk of fielding a bad or buggy ðð â Number of A/B Tests limited â Long turnaround time
Evaluating Online Metrics Offline
⢠Online: On-policy A/B Test
⢠Offline: Off-policy Counterfactual Estimates
Draw ð1 from ð1 ð ð1
Draw ð2 from ð2 ð ð2
Draw ð3 from ð3 ð ð3
Draw ð4 from ð4 ð ð4
Draw ð5 from ð5 ð ð5
Draw ð6 from ð6 ð ð6
Draw ð from ð0
ð â1
ð â1
ð â1
ð â1
ð â1
ð ð6
ð â1
ð â1
ð â1
ð â1
ð â1
ð ð12
ð â1
ð â1
ð â1
ð â1
ð â1
ð ð18
ð â1
ð â1
ð â1
ð â1
ð â1
ð ð24
ð â1
ð â1
ð â1
ð â1
ð â1
ð ð30
Draw ð7 from ð7 ð ð7
Evaluation: Outline ⢠Evaluating Online Metrics Offline
â A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
⢠Approach 1: âModel the worldâ â Estimation via reward prediction
⢠Approach 2: âModel the biasâ â Counterfactual Model
â Inverse propensity scoring (IPS) estimator
Approach 1: Reward Predictor ⢠Idea:
â Use ð = ð¥1, ðŠ1, ð¿1 , ⊠, ð¥ð, ðŠð, ð¿ð from ð0 to estimate reward predictor ð¿ ð¥, ðŠ
⢠Deterministic ð: Simulated A/B Testing with predicted ð¿ ð¥, ðŠ â For actions ðŠð
â² = ð(ð¥ð) from new policy ð, generate predicted log ðâ² = ð¥1, ðŠ1
â² , ð¿ ð¥1, ðŠ1â² , ⊠, ð¥ð, ðŠð
â² , ð¿ ð¥ð, ðŠðâ²
â Estimate performace of ð via ð ðð ð =1
ð ð¿ ð¥ð , ðŠð
â²ðð=1
⢠Stochastic ð: ð ðð ð =1
ð ð¿ ð¥ð , ðŠðŠ ð(ðŠ|ð¥ð)
ðð=1
ð¿ ð¥, ðŠ1 ð¿ ð¥, ðŠ2
ð|ð¥ ð¿ ð¥, ðŠ
ð¿ ð¥, ðŠâ²
Regression for Reward Prediction
Learn ð¿ : ð¥ à ðŠ â â
1. Represent via features Κ ð¥, ðŠ 2. Learn regression based on Κ ð¥, ðŠ
from ð collected under ð0
3. Predict ð¿ ð¥, ðŠâ² for ðŠâ² = ð(ð¥) of new policy ð
ð¿ (ð¥, ðŠ)
ð¿ ð¥, ðŠâ²
Κ1
Κ2
News Recommender: Exp Setup ⢠Context x: User profile
⢠Action y: Ranking â Pick from 7 candidates
to place into 3 slots
⢠Reward ð¿: âRevenueâ â Complicated hidden
function
⢠Logging policy ð0: Non-uniform randomized logging system â Placket-Luce âexplore around current production rankerâ
![Page 4: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/4.jpg)
4
News Recommender: Results
RP is inaccurate even with more training and logged data
Problems of Reward Predictor
⢠Modeling bias
â choice of features and model
⢠Selection bias
â Ï0âs actions are over-represented
ð ðð ð =1
ð ð¿ ð¥ð , ð ð¥ð
ð
ð¿ (ð¥, ðŠ)
ð¿ ð¥, ð ð¥
Κ1
Κ2 Can be unreliable and biased
Evaluation: Outline ⢠Evaluating Online Metrics Offline
â A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
⢠Approach 1: âModel the worldâ â Estimation via reward prediction
⢠Approach 2: âModel the biasâ â Counterfactual Model
â Inverse propensity scoring (IPS) estimator
Approach âModel the Biasâ
⢠Idea:
Fix the mismatch between the distribution ð0 ð ð¥ that generated the data and the distribution ð ð ð¥ we aim to evaluate.
U ð0 = ð¿ ð¥, ðŠ ð0 ðŠ ð¥ ð ð¥ ðð¥ ððŠ
ð ðŠ ð¥ ð
Counterfactual Model ⢠Example: Treating Heart Attacks
â Treatments: ð ⢠Bypass / Stent / Drugs
â Chosen treatment for patient xð: yð â Outcomes: ÎŽð
⢠5-year survival: 0 / 1
â Which treatment is best?
01
1
1
01
0
1
1
10
Pati
ents
xðâ
1,.
..,ð
Counterfactual Model ⢠Example: Treating Heart Attacks
â Treatments: ð ⢠Bypass / Stent / Drugs
â Chosen treatment for patient xð: yð â Outcomes: ÎŽð
⢠5-year survival: 0 / 1
â Which treatment is best?
01
1
1
01
0
1
1
10
Pati
ents
ðâ
1,.
..,ð
Placing Vertical
Click / no Click on SERP
Pos 1 / Pos 2/ Pos 3
Pos 1
Pos 2
Pos 3
![Page 5: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/5.jpg)
5
Counterfactual Model ⢠Example: Treating Heart Attacks
â Treatments: ð ⢠Bypass / Stent / Drugs
â Chosen treatment for patient xð: yð â Outcomes: ÎŽð
⢠5-year survival: 0 / 1
â Which treatment is best? ⢠Everybody Drugs ⢠Everybody Stent ⢠Everybody Bypass Drugs 3/4, Stent 2/3, Bypass 2/4 â really?
01
1
1
01
0
1
1
10
Pati
ents
xi,
ðâ
1,.
..,ð
Treatment Effects
⢠Average Treatment Effect of Treatment ðŠ
â U ðŠ =1
ð ð¿(ð¥ð , ðŠ)ð
⢠Example
â U ððŠððð ð =4
11
â U ð ð¡ððð¡ =6
11
â U ððð¢ðð =3
11
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
Factual Outcome
Counterfactual Outcomes
Assignment Mechanism ⢠Probabilistic Treatment Assignment
â For patient i: ð0 ðð = ðŠ|ð¥ð â Selection Bias
⢠Inverse Propensity Score Estimator
â
â Propensity: pi = ð0 ðð = ðŠð|ð¥ð
â Unbiased: ðž ð ðŠ =ð ðŠ , if ð0 ðð = ðŠ|ð¥ð > 0 for all ð
⢠Example
â ð ððð¢ðð =1
11
1
0.8+
1
0.7+
1
0.8+
0
0.1
= 0.36 < 0.75
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
0.30.50.1
0.60.40.1
0.1 0.1 0.8
0.60.20.7
0.30.50.2
0.10.70.1
0.10.10.30.30.4
0.10.80.30.60.4
0.80.10.40.10.2
ð0 ðð = ðŠ|ð¥ð
ð ððð ðŠ =1
ð
ð{ðŠð= ðŠ}
ððð¿(ð¥ð , ðŠð)
ð
Experimental vs Observational ⢠Controlled Experiment
â Assignment Mechanism under our control â Propensities ðð = ð0 ðð = ðŠð|ð¥ð are known by design â Requirement: âðŠ: ð0 ðð = ðŠ|ð¥ð > 0 (probabilistic)
⢠Observational Study â Assignment Mechanism not under our control â Propensities ðð need to be estimated â Estimate ð 0 ðð|ð§ð = ð0 ðð ð¥ð) based on features ð§ð â Requirement: ð 0 ðð ð§ð) = ð 0 ðð ð¿ð , ð§ð (unconfounded)
Conditional Treatment Policies ⢠Policy (deterministic)
â Context ð¥ð describing patient â Pick treatment ðŠð based on ð¥ð: yi = ð(ð¥ð) â Example policy:
⢠ð ðŽ = ððð¢ðð , ð ðµ = ð ð¡ððð¡, ð ð¶ = ððŠððð ð
⢠Average Treatment Effect
â ð ð =1
ð ð¿(ð¥ð , ð ð¥ð )ð
⢠IPS Estimator
â
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
ðµð¶ðŽðµðŽðµðŽð¶ðŽð¶ðµ
ð ððð ð =1
ð
ð{ðŠð= ð(ð¥ð)}
ððð
ð¿(ð¥ð , ðŠð)
Stochastic Treatment Policies ⢠Policy (stochastic)
â Context ð¥ð describing patient â Pick treatment ðŠ based on ð¥ð: ð(ð|ð¥ð)
⢠Note â Assignment Mechanism is a stochastic policy as well!
⢠Average Treatment Effect
â ð ð =1
ð ð¿(ð¥ð , ðŠ)ð ðŠ|ð¥ððŠð
⢠IPS Estimator
â ð ð =1
ð
ð ðŠð ð¥ð
ððð ð¿(ð¥ð , ðŠð)
Pati
ents
0 1 0
1 1 0
0 0 1
001
010
010
10011
01111
10000
ðµð¶ðŽðµðŽðµðŽð¶ðŽð¶ðµ
![Page 6: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/6.jpg)
6
Counterfactual Model = Logs Medical Search Engine Ad Placement Recommender
Context ð¥ð Diagnostics Query User + Page User + Movie
Treatment ðŠð BP/Stent/Drugs Ranking Placed Ad Watched Movie
Outcome ð¿ð Survival Click metric Click / no Click Star rating
Propensities ðð controlled (*) controlled controlled observational
New Policy ð FDA Guidelines Ranker Ad Placer Recommender
T-effect U(ð) Average quality of new policy.
Rec
ord
ed in
Lo
g
Evaluation: Outline ⢠Evaluating Online Metrics Offline
â A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
⢠Approach 1: âModel the worldâ â Estimation via reward prediction
⢠Approach 2: âModel the biasâ â Counterfactual Model
â Inverse propensity scoring (IPS) estimator
System Evaluation via Inverse Propensity Scoring
Definition [IPS Utility Estimator]: Given ð = ð¥1, ðŠ1, ð¿1 , ⊠, ð¥ð , ðŠð, ð¿ð collected under ð0,
Unbiased estimate of utility for any ð, if propensity nonzero whenever ð ðŠð ð¥ð > 0. Note:
If ð = ð0, then online A/B Test with
Off-policy vs. On-policy estimation.
ð ððð ð =1
ð ð¿ð
ð
ð=1
ð ðŠð ð¥ð
ð0 ðŠð ð¥ð
[Horvitz & Thompson, 1952] [Rubin, 1983] [Zadrozny et al., 2003] [Li et al., 2011]
Propensity ðð
ð ððð ð0 =1
ð ð¿ð
ð
Illustration of IPS
IPS Estimator:
ð0 ð ð¥ ð(ð|ð¥)
ð ðŒðð ð =1
ð
ð ðŠð ð¥ð
ð0 ðŠð|ð¥ðð
ð¿ð
IPS Estimator is Unbiased
=1
ð ð0 ðŠ1 ð¥1 ð(ð¥1) ⊠ð0 ðŠð ð¥ð ð(ð¥ð)
ð ðŠð ð¥ð
ð0 ðŠð ð¥ð)ð¿ ð¥ð , ðŠð
ð¥ð,ðŠðð¥1,ðŠ1
ð
=1
ð ð0 ðŠð ð¥ð ð(ð¥ð)
ð ðŠð ð¥ð
ð0 ðŠð ð¥ð)ð¿ ð¥ð , ðŠð
ð¥ð,ðŠð
ð
=1
ð ð ðŠð ð¥ð ð(ð¥ð)ð¿ ð¥ð , ðŠð
ð¥ð,ðŠð
ð
=1
ð U(Ï)
ð
= ð ð
ðž ð ð =1
ð âŠ
ð ðŠð ð¥ð
ð0 ðŠð ð¥ð)ð¿ ð¥ð , ðŠð
ðð¥ð,ðŠð
ð0 ðŠ1 ð¥1 ⊠ð0 ðŠð ð¥ð ð ð¥1 ⊠ð(ð¥ð)
ð¥1,ðŠ1
=1
ð ð0 ðŠ1 ð¥1 ð(ð¥1)⊠ð0 ðŠð ð¥ð ð(ð¥ð)
ð ðŠð ð¥ð
ð0 ðŠð ð¥ð)ð¿ ð¥ð , ðŠð
ðð¥ð,ðŠðð¥1,ðŠ1
Probabilistic Assignment
News Recommender: Results
IPS eventually beats RP; variance decays as ð1
ð
![Page 7: Counterfactual Model Interactive System Schematic for ......âLong turnaround time ð Evaluating Online Metrics Offline â¢Online: On-policy A/B Test â¢Offline: Off-policy Counterfactual](https://reader033.vdocuments.site/reader033/viewer/2022051915/6006dc9fca4358190f623c58/html5/thumbnails/7.jpg)
7
Evaluation: Outline ⢠Evaluating Online Metrics Offline
â A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
⢠Approach 1: âModel the worldâ â Estimation via reward prediction
⢠Approach 2: âModel the biasâ â Counterfactual Model
â Inverse propensity scoring (IPS) estimator