lessons for external validity from large scale experimentation · validity from large scale ......
TRANSCRIPT
Lessons for External Validity from Large Scale ExperimentationSUSAN ATHEY, STANFORD
TWO ‐DAY COURSE ON MACHINE LEARNING AND CAUSAL INFERENCE WITH VIDEOS AND SCRIPTS : HTTPS://WWW.AEAWEB.ORG/CONFERENCE/CONT ‐ED/2018 ‐WEBCASTS
SURVEY PAPER: HTTPS: //WWW.NBER.ORG/CHAPTERS/C14009.PDF
L INKS TO PAPERS: HTTPS://ATHEY.PEOPLE.STANFORD.EDU/RESEARCH
OverviewExperimentation at tech firms is ubiquitous and has become a large research field
Some problems it solves easily◦ Variations◦ Bandits
Some problems it solves with more complex designs◦ Interference◦ Staggered adoption◦ Multiple randomization designs
Some problems require modeling or offline simulation on top of or instead of experimentation
Validity◦ Can validate w/ ongoing experimentation◦ Lots of small experiments
Heterogenous treatment effects – one angle◦ Estimate them◦ Look for systematic differences in 𝜏 𝑥 across settings
Analytics • Previous observational data • Previous experiments
Innovation • Algorithmic development, e.g. personalization• Pilot experiment
Experimental Design
• Develop KPIs and validate externally• Formulate hypotheses• Pre‐analysis planning• Advanced experimentation (e.g. adaptive)
Analyze and Improve
• Generalizable insights• Tactical insights• New innovation plan
Tech Firm Experimentation
Experimentation Research at Tech Firms
My Research on Design and Analysis of Experiments
Surrogates (Athey, Chetty, Imbens, Kang, 2016, update coming shortly)
Heterogeneous treatment effects Athey & Imbens (PNAS 2016)Wager & Athey (JASA, 2018); Athey, Tibshirani, and Wager (AOS, 2019); Friedberg, Athey, Tibshirani, and Wager (2018)
Offline policy estimation (Athey and Wager, 2017; Zhou, Athey, and Wager 2018)
Improving estimation model used in contextual bandit algorithms (Dimakopoulou, Zhou, Athey and Imbens, AAAI, 2018)
Designing experiments with staggered rollouts (Xiong, Athey, Bayati, Imbens 2019)
Testing hypotheses using adaptively collected data (Hadad, Hirschberg, Zhan, Wager, Athey, 2019)
Survey Athey & Imbens, The Econometrics of Randomized Experiments (Handbook of Experimental Economics)
7
A/B Testing:Typical Applications
Performance of ad copy
Factorial experiments for email campaigns
Compare two ranking algorithms
Background color for website
Change the signup process
A/B Testing:Challenges
• Surrogates, structural models
Long term effects
•Bandits, factorial experiments
Many arms
•Clustered randomization and design
Interference (networks)
• Staggered rollout designs, clustered randomization, structural models
Marketplace experimentation
• Structural models, offline simulators + experiments
Equilibrium adjustment
• Experiment splitting, empirical Bayes, other recent work
Adjusting for many experiments
Validity Challenges and Tactical Solutions
Impacts are specific to the state of the system/market at time of experiment• Continue to experiment• Long term holdouts (e.g. no‐ads group)
Impacts are specific to the platform/company• Likely true for many/most• Yet, many common themes• Reputation systems• Supplier incentives• Consumer marketing and promotions• Add‐on fees/delivery charges/taxes
Active LearningBandits:◦ Balance exploration (learning) and exploitation(getting the best outcome for each subject)
◦ Heuristics such as Thompson Sampling◦ Assign treatment in proportion to probability it is optimal
System interacts with its environment, taking actions or assigning treatments
Active Learning
Bandits:◦ Balance exploration (learning) and exploitation(getting the best outcome for each subject)
◦ Heuristics such as Thompson Sampling◦ Assign treatment in proportion to probability it is optimal
System interacts with its environment, taking actions or assigning treatments
Testing hypotheses with adaptively collected data
IPW Estimator
Simple Mean
Weighted IPW
Hadad, Hirschberg, Zhan, Wager, Athey (2019)
Active Learning
Contextual bandits:◦ Learn a targeted treatment assignment policy mapping from individual characteristics to treatments
𝜋:𝕏 → 𝕎
◦ Consider batches of subjects◦ After each batch, estimate a model mapping characteristics to (counterfactual) outcomes for each treatment
◦ Then apply bandit heuristics
System interacts with its environment, taking actions or assigning treatments
Outcomes for different arms depend on contexts
Outcomes for different arms depend on contexts
Doubly robust contextual bandit learns the optimal treatment assignment policy
Estimation along the path plagued by adaptivity of assignment process; weighting creates variance as assignment probabilities converge
Heterogeneous Treatment Effect Estimation
Estimate CATE • If sufficient data to estimate this well, address problem that other environments have different populations
Estimate CATE • s is the “state,” e.g. time, place• If CATE varies with s or if ATE differs across s after adjusting for covariates,concern about validity with unseen s
ML and Econometrics
Supervised learning: ◦ Can evaluate in test set in model‐free way
MSE: ∑ 𝑌 𝜇 𝑋Causal inference◦ Objective: unbiased/consistent parameter estimation◦ Parameters of interest not observed in test set◦ Can estimate objective (MSE of parameter), but requires maintained assumptions, often not model‐free
Infeasible MSE: ∑ 𝜃 𝜃 𝑋◦ Tune for counterfactuals: distinct from tuning for fit, also different counterfactuals select different models
◦ Theoretical assumptions, domain knowledge ◦ Sampling variation matters even in large data sets◦ Statistical theory and inference play important roles
Causal inference vs. Supervised ML
Causal Inference Approaches
Goal: estimate the causal impact of interventions or treatment assignment policies◦ Low dimensional intervention◦ Desire confidence intervals
Estimands◦ Average effect◦ Heterogeneous effects◦ Optimal policy
Designs that enable identification and estimation of these effects◦ Randomized experiments◦ Unconfoundedness◦ “Natural” experiments (IV)◦ Regression discontinuity◦ Difference‐in‐difference◦ Longitudinal data◦ Randomized and natural experiments in social network/settings w/ interference
“Program evaluation”, “Treatment effect estimation”
For each
Estimand X Design
New ML‐based method, theory, confidence intervals
My own work on ML/Causal InferencePitfalls of Pure Prediction “Beyond Prediction: Using Big Data for Policy Problems,” Science, 2017“The Impact of Machine Learning on Economics,” The Economics of Artificial Intelligence
Stable/robust prediction and estimation“Stable Prediction across Unknown Environments,” (with Kun Kuang, Ruoxuan Xiong, Peng Cui, Bo Li), Knowledge Discovery & Data Mining, 2018.“Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges,” (with Guido Imbens, Thai Pham, and Stefan Wager), American Economic Review, May 2017“A Measure of Robustness to Misspecification” (with Guido Imbens), American Economic Review, May 2015, 105 (5), 476‐480
Surrogates“Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index” (with Raj Chetty, Guido Imbens, Hyunseung Kang), 2016
Combining ML and Structural Models of Consumer Behavior“Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile Location Data,” (with David Blei, Robert Donnelly, Francisco Ruiz, and Tobias Schmidt), American Economic Review Papers and Proceedings, May, 2018“SHOPPER: A Probabilistic Model of Consumer Choice with Substitutes and Complements,” 2017, (with Francisco Ruiz and David Blei).“Counterfactual Inference for Consumer Choice Across Many Product Categories” (with David Blei, Rob Donnelly, Francisco Ruiz)
Generative Adversarial Networks“Using Wasserstein Generative Adversial Networks for the Design of Monte Carlo Simulations” with Guido Imbens, Jonas Metzger, Evan Munro
Causal Panel Data ModelsAthey, Bayati, Duodechenko, Khosravi, Imbens: “Matrix Completion Methods for Causal Panel Data Models” 2018Arkhangelsky, Athey, Hirschberg, Imbens, Wager: “Synthetic Difference in Differences” 2018Johannemann, Hadad, Athey, Wager: “Sufficient Representations for Categorical Variables”Xiong, Athey, Bayati, Imbens: “Optimal Experimental Designs for Staggered Rollouts” 2019
Treatment Effects, Assignment Policies“Recursive Partitioning for Heterogeneous Causal Effects” (with Guido Imbens), PNAS 2016“Estimation and Inference of Heterogeneous Treatment Effects using Random Forests” (with Stefan Wager), Journal of the American Statistical Association, 2018.“Generalized Random Forests,” with Julie Tibshirani and Stefan Wager, Annals of Statistics, 2019. “Efficient Policy Learning,” with Stefan Wager, 2017. “Offline Multi‐Action Policy Learning: Generalization and Optimization,” (with Zhengyuan Zhou and Stefan Wager)“Local Linear Forests,” (with Rina Friedberg, Julie Tibshirani, and Stefan Wager), 2018.
Bandits, Contextual Bandits“Balanced Linear Contextual Bandits,” with Maria Dimakopoulou, Zhengyuan Zhou, and Guido Imbens, Association for the Advancement of Artificial Intelligence (AAAI), 2019.Hadad, Hirschberg, Zhan, Wager, Athey, “Confidence Intervals for Policy Evaluation in Adaptive Experiments.”
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of
I A feature vector Xi ∈ Rp,
I A response Yi ∈ R, and
I A treatment assignment Wi ∈ {0, 1}.
Following the potential outcomes framework (Holland, 1986,Imbens and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin,1974), we posit the existence of quantities Y
(0)i and Y
(1)i .
I These correspond to the response we would have measuredgiven that the i-th subject received treatment (Wi = 1) or notreatment (Wi = 0).
The potential outcomes framework
For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of
I A feature vector Xi ∈ Rp,
I A response Yi ∈ R, and
I A treatment assignment Wi ∈ {0, 1}.
Goal is to estimate the conditional average treatment effect
τ (x) = E[Y (1) − Y (0)
∣∣X = x].
NB: In experiments, we only get to see Yi = Y(Wi )i .
The potential outcomes framework
If we make no further assumptions, estimating τ(x) is not possible.
I Literature often assumes unconfoundedness (Rosenbaumand Rubin, 1983)
{Y (0)i ,Y
(1)i }⊥⊥Wi
∣∣ Xi .
I When this assumption holds, methods based on matching orpropensity score estimation are usually consistent.
Causal Trees
Divide population into subgroups to minimize MSE in treatmenteffects
I Goal: report heterogeneity without pre-analysis plan but withvalid confidence intervals
I Moving the goalposts: method defines estimand (treatmenteffects for subgroups) and generates estimates
I Solve over-fitting problem with sample splitting: choosesubgroups in half the sample and estimate on other half
Challenges
I Objective function is infeasible:∑
i
[(τi − τ(Xi ))2
]I Need to estimate objective to optimize for it rather than take
a simple average of squared error∑
i
[(Yi − µ(Xi ))2
]I Estimand is unstable
Notation for Partitions and Leaf Effect Estimates
Three samples: model selection/tree construction: Str , estimationsample for leaf effects Sest , and a (hypothetical) test sample Ste .
Given a partition Π, τ(Xi ;Sest ,Π) is the sample average treatmenteffect in sample Sest for the leaf `(Xi ; Π) associated withcovariates Xi :
τ(Xi ;Sest ,Π) =1∑
j∈Sest∩`(Xi ;Π) Wi
∑j∈Sest∩`(Xi ;Π)
WiYi−
1∑j∈Sest∩`(Xi ;Π)(1−Wi )
∑j∈Sest∩`(Xi ;Π)
(1−Wi )Yi
Estimating the MSE Criterion
Criterion for evaluating a partition Π anticipating re-estimating leafeffects using sample splitting:
MSE (Sest ,Ste) =∑i∈Ste
(τi − τ(Xi ;Sest ,Π))2
=∑i∈Ste
(τ2i − 2 · τi · τ(Xi ;Sest ,Π) + τ2(Xi ;Sest ,Π)
)
EMSE = ESest ,Ste[MSE (Sest ,Ste)
]= VSest ,Xi
[τ(Xi ; Π,Sest)
]− EXi
[τ2(Xi ; Π)
]+ E [τ2
i ]
The last equality makes use of fact that estimates are unbiased inindependent test sample. Can construct empirical estimates ofeach of these quantities except for the last which does not dependon Π and thus does not affect partition selection.
Causal Tree Algorithm
I Divide data into tree-building Str and estimation Sest samplesI Use a greedy algorithm to recursively partition covariate spaceX into a deep partition ΠI At each node the split is selected as the one that minimizes
our estimate of EMSE over all possible binary splitsI Preserve minimum number of treated and control units in each
child leaf
I Use cross-validation to select the depth d∗ of the partitionthat minimizes an estimate of MSE of treatment effects, usingleft-out folds as proxies for the test set
I Select partition Π∗ by pruning Π to depth d∗, pruning leavesthat provide the smallest improvement in goodness of fit
I Estimate the treatment effects in each leaf of Π∗ using theestimation sample S
Causal Trees: Search Demotion Example
Causal Trees: Search Demotion Example
Causal Trees: Search Demotion Example
Causal Trees: Adaptive versus Honest Estimates
Crucial to use sample splitting!
Low-Dimensional Representations v. Fully NonparametricEstimation
Causal Trees
I Move the goalpost, but get guaranteed coverage
I Easy to interpret, easy to mis-interpret
I Can be many trees
I Leaves differ in many ways if covariates correlated; describeleaves by means in all covariates
Causal Forests
I Attempt to estimate τ(x)
I Can estimate partial effects
I In high dimensions, still can have omitted variable issues
I Confidence intervals lose coverage in high dimensions (bias)
Baseline method: k-NN matching
Consider the k-NN matching estimator for τ(x):
τ (x) =1
k
∑S1(x)
Yi −1
k
∑S0(x)
Yi ,
where S0/1(x) is the set of k-nearest cases/controls to x . This isconsistent given unconfoundedness and regularity conditions.
I Pro: Transparent asymptotics and good, robust performancewhen p is small.
I Con: Acute curse of dimensionality, even when p = 20 andn = 20k .
NB: Kernels have similar qualitative issues as k-NN.
Adaptive nearest neighbor matching
Random forests are a a popular heuristic for adaptive nearestneighbors estimation introduced by Breiman (2001).
I Pro: Excellent empirical track record.
I Con: Often used as a black box, without statistical discussion.
There has been considerable interest in using forest-like methodsfor treatment effect estimation, but without formal theory.
I Green and Kern (2012) and Hill (2011) have considered usingBayesian forest algorithms (BART, Chipman et al., 2010).
I Several authors have also studied related tree-basedmethods: Athey and Imbens (2016), Su et al. (2009), Taddyet al. (2014), Wang and Rudin (2015), Zeilis et al. (2008), ...
Wager and Athey (2018) provide the first formal results allowingrandom forest to be used for provably valid asymptotic inference.
Making k-NN matching adaptiveAthey and Imbens (2016) introduce causal tree: definesneighborhoods for matching based on recursive partitioning(Breiman, Friedman, Olshen, and Stone, 1984), advocate samplesplitting (w/ modified splitting rule) to get assumption-freeconfidence intervals for treatment effects in each leaf.
Euclidean neighborhood,for k-NN matching.
Tree-based neighborhood.
From trees to random forests (Breiman, 2001)
Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor
τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .
Random forest idea: build and average many different trees T ∗:
τ (x) =1
B
B∑b=1
T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .
· · ·
From trees to random forests (Breiman, 2001)
Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor
τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .
Random forest idea: build and average many different trees T ∗:
τ (x) =1
B
B∑b=1
T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .
We turn T into T ∗ by:
I Bagging / subsampling the training set (Breiman, 1996); thishelps smooth over discontinuities (Buhlmann and Yu, 2002).
I Selecting the splitting variable at each step from m out of prandomly drawn features (Amit and Geman, 1997).
Statistical inference with regression forests
Honest trees do not use the same data to select partition (splits)and make predictions. Ex: Split-sample trees, propensity trees.
Theorem. (Wager and Athey, JASA, 2018) Regression forests areasymptotically Gaussian and centered,
µn (x)− µ (x)
σn (x)⇒ N (0, 1) , σ2
n(x)→p 0,
given the following assumptions (+ technical conditions):
1. Honesty. Individual trees are honest.
2. Subsampling. Individual trees are built on randomsubsamples of size s � nβ, where βmin < β < 1.
3. Continuous features. The features Xi have a density that isbounded away from 0 and ∞.
4. Lipschitz response. The conditional mean functionµ(x) = E
[Y∣∣X = x
]is Lipschitz continuous.
The random forest kernel
· · ·
=⇒
Forests induce a kernel via averaging tree-based neighborhoods.This idea was used by Meinshausen (2006) for quantile regression.
Applications in Economics and Marketing
I Hitsch and Misra (2017): Use causal forests to target catalogmailings. Causal forest detects significant heterogeneity,performs better than alternatives including LASSO andoff-the-shelf random forest
I Davis and Heller (2017): Analyze heterogeneous impacts ofsummer jobs using causal forest
I Athey, Campbell, Chyn, Hastings, and White (2018): Usecausal forest to show that re-employment services didn’tbenefit in ATE, but targeted policy can have substantialbenefits
Labor Market - Reemployment ServicesAthey, Campbell, Chyn, Hastings, and White (2018)
I Goal: Increase job skills and employment for all RhodeIslanders (efficiently)
I Measure impact of employment service programsI Take advantage of a field experiment run by US Department
of Labor to measure the impact of employment services on UIand subsequent employment
I From 2005-2015, states were asked to randomly send lettersto UI claimants requiring employment services for continuedUI receipt
I Basic evaluation of 4 states finds mixed evidence on decreasein UI and earnings impacts
I In RI we find that nudge decreased weeks on UI by 1.4/21, noimpact on earnings
I Measure impact using new administrative data and causalforest (Wager and Athey 2018) to understand who benefits
I Use causal forest estimates to simulate benefits of targetedletters
HTE in Rhode Island Re-employment Services Example
HTE in Rhode Island Re-employment Services Example
ML and Structural Models:Shopping Application
Scanner data from supermarket◦ Product hierarchy (category, class, subclass, UPC)◦ Prices change Tuesday evening◦ Study 123 high‐frequency categories with 1263 UPCs◦ Multiple UPCs per category◦ Typically purchase only one UPC per trip in categroy◦ Independent price changes◦ Not too much seasonality◦ 333,000 shopping trips for ~2000 consumers over 20 months
Economic Goals:◦ Optimal pricing◦ Benefits of personalization versus simpler segmentation
Methodological Goals:◦ Contrast off‐the‐shelf ML, off‐the‐shelf econometrics with combined models
◦ Tune and test models for counterfactual performance
Joint work with Rob Donnelly, David Blei, Fran Ruiz
Combine structural model with matrix factorization techniques and computational methods from ML
Structural Model Matrix FactorizationMixed logit• User u, product i, time t
𝜇 𝜈 𝛽𝑋 𝛼 𝑝𝑈 𝜇 𝜖
• If 𝜖 i.i.d. Type I EV, then
Pr 𝑌 𝑖exp 𝜇
∑ exp 𝜇• Counterfactuals
• Out of stock• Price changes
Users
Items
𝑈 𝐼
𝑈 𝐾 𝐾 𝐼
Structural Model + FactorizationMixed logit• User u, product i, time t
𝜇 𝜈 𝜅𝑋 𝛼 𝑝𝑈 𝜇 𝜖
• If 𝜖 i.i.d. Type I EV, then
Pr 𝑌 𝑖exp 𝜇
∑ exp 𝜇• Counterfactuals
• Out of stock• Price changes
Mixed logit + factors• User u, product i, time t
𝜇 𝛽 𝜃 𝜅 𝑋 𝜌 𝛼 𝑝
• Add in nesting for outside good• Implement as two‐stage estimation with inclusive value (McFadden)
• Also factorization of outside good
Model Comparisons
Nested Factorization◦ All categories estimated in single model◦ Items substitutes within category, independent across◦ Tuned on held‐out validation set
Hierarchical Poisson Factorization (HPF)◦ All items in single model, each item independent of others◦ A form of matrix factorization allowing for covariates◦ Ignores prices◦ Scales easily
Category by category logits◦ Mixed logit (random coefficients)◦ Nested Logit◦ With various controls (demographic, etc.)
Logits with HPF Factors◦ Include user‐item prediction from HPF model
Performance by Scenario(Counterfactual)
Evaluate log‐likelihood only in weeks where an item falls into specified scenarios:
• Price changed for the item this week
• Price changed for another item in the same category this week
• Another item in the same category is out of stock at least one day this week
Traditional logits improve with HPF(ML‐based user‐item predictions)
Validation of Structural Parameter EstimatesCompare Tues‐Wed change in price to Tues‐Wed change in demand, in test setBreak out results by how price‐sensitive (elastic) we have estimated consumers to be