lessons for external validity from large scale experimentation · validity from large scale ......

Lessons for External Validity from Large Scale ExperimentationSUSAN ATHEY, STANFORD

TWO ‐DAY COURSE ON MACHINE LEARNING AND CAUSAL INFERENCE WITH VIDEOS AND SCRIPTS : HTTPS://WWW.AEAWEB.ORG/CONFERENCE/CONT ‐ED/2018 ‐WEBCASTS

SURVEY PAPER: HTTPS: //WWW.NBER.ORG/CHAPTERS/C14009.PDF

L INKS TO PAPERS: HTTPS://ATHEY.PEOPLE.STANFORD.EDU/RESEARCH

OverviewExperimentation at tech firms is ubiquitous and has become a large research field

Some problems it solves easily◦ Variations◦ Bandits

Some problems it solves with more complex designs◦ Interference◦ Staggered adoption◦ Multiple randomization designs

Some problems require modeling or offline simulation on top of or instead of experimentation

Validity◦ Can validate w/ ongoing experimentation◦ Lots of small experiments

Heterogenous treatment effects – one angle◦ Estimate them◦ Look for systematic differences in 𝜏 𝑥 across settings

Analytics • Previous observational data • Previous experiments

Innovation • Algorithmic development, e.g. personalization• Pilot experiment

Experimental Design

• Develop KPIs and validate externally• Formulate hypotheses• Pre‐analysis planning• Advanced experimentation (e.g. adaptive)

Analyze and Improve

• Generalizable insights• Tactical insights• New innovation plan

Tech Firm Experimentation

Experimentation Research at Tech Firms

My Research on Design and Analysis of Experiments

Surrogates (Athey, Chetty, Imbens, Kang, 2016, update coming shortly)

Heterogeneous treatment effects Athey & Imbens (PNAS 2016)Wager & Athey (JASA, 2018); Athey, Tibshirani, and Wager (AOS, 2019); Friedberg, Athey, Tibshirani, and Wager (2018)

Offline policy estimation (Athey and Wager, 2017; Zhou, Athey, and Wager 2018)

Improving estimation model used in contextual bandit algorithms (Dimakopoulou, Zhou, Athey and Imbens, AAAI, 2018)

Designing experiments with staggered rollouts (Xiong, Athey, Bayati, Imbens 2019)

Testing hypotheses using adaptively collected data (Hadad, Hirschberg, Zhan, Wager, Athey, 2019)

Survey Athey & Imbens, The Econometrics of Randomized Experiments (Handbook of Experimental Economics)

7

A/B Testing:Typical Applications

Performance of ad copy

Factorial experiments for email campaigns

Compare two ranking algorithms

Background color for website

Change the signup process

A/B Testing:Challenges

• Surrogates, structural models

Long term effects

•Bandits, factorial experiments

Many arms

•Clustered randomization and design

Interference (networks)

• Staggered rollout designs, clustered randomization, structural models

Marketplace experimentation

• Structural models, offline simulators + experiments

Equilibrium adjustment

• Experiment splitting, empirical Bayes, other recent work

Adjusting for many experiments

Validity Challenges and Tactical Solutions

Impacts are specific to the state of the system/market at time of experiment• Continue to experiment• Long term holdouts (e.g. no‐ads group)

Impacts are specific to the platform/company• Likely true for many/most• Yet, many common themes• Reputation systems• Supplier incentives• Consumer marketing and promotions• Add‐on fees/delivery charges/taxes

Active LearningBandits:◦ Balance exploration (learning) and exploitation(getting the best outcome for each subject)

◦ Heuristics such as Thompson Sampling◦ Assign treatment in proportion to probability it is optimal

System interacts with its environment, taking actions or assigning treatments

Active Learning

Bandits:◦ Balance exploration (learning) and exploitation(getting the best outcome for each subject)

◦ Heuristics such as Thompson Sampling◦ Assign treatment in proportion to probability it is optimal


Testing hypotheses with adaptively collected data

IPW Estimator

Simple Mean

Weighted IPW

Hadad, Hirschberg, Zhan, Wager, Athey (2019)

Active Learning

Contextual bandits:◦ Learn a targeted treatment assignment policy mapping from individual characteristics to treatments

𝜋:𝕏 → 𝕎

◦ Consider batches of subjects◦ After each batch, estimate a model mapping characteristics to (counterfactual) outcomes for each treatment

◦ Then apply bandit heuristics


Outcomes for different arms depend on contexts

Outcomes for different arms depend on contexts

Doubly robust contextual bandit learns the optimal treatment assignment policy

Estimation along the path plagued by adaptivity of assignment process; weighting creates variance as assignment probabilities converge

Heterogeneous Treatment Effect Estimation

Estimate CATE • If sufficient data to estimate this well, address problem that other environments have different populations

Estimate CATE • s is the “state,” e.g. time, place• If CATE varies with s or if ATE differs across s after adjusting for covariates,concern about validity with unseen s

ML and Econometrics

Supervised learning: ◦ Can evaluate in test set in model‐free way

MSE: ∑ 𝑌 𝜇 𝑋Causal inference◦ Objective: unbiased/consistent parameter estimation◦ Parameters of interest not observed in test set◦ Can estimate objective (MSE of parameter), but requires maintained assumptions, often not model‐free

Infeasible MSE: ∑ 𝜃 𝜃 𝑋◦ Tune for counterfactuals: distinct from tuning for fit, also different counterfactuals select different models

◦ Theoretical assumptions, domain knowledge ◦ Sampling variation matters even in large data sets◦ Statistical theory and inference play important roles

Causal inference vs. Supervised ML

Causal Inference Approaches

Goal: estimate the causal impact of interventions or treatment assignment policies◦ Low dimensional intervention◦ Desire confidence intervals

Estimands◦ Average effect◦ Heterogeneous effects◦ Optimal policy

Designs that enable identification and estimation of these effects◦ Randomized experiments◦ Unconfoundedness◦ “Natural” experiments (IV)◦ Regression discontinuity◦ Difference‐in‐difference◦ Longitudinal data◦ Randomized and natural experiments in social network/settings w/ interference

“Program evaluation”, “Treatment effect estimation”

For each

Estimand X Design

New ML‐based method, theory, confidence intervals

My own work on ML/Causal InferencePitfalls of Pure Prediction “Beyond Prediction: Using Big Data for Policy Problems,” Science, 2017“The Impact of Machine Learning on Economics,” The Economics of Artificial Intelligence

Stable/robust prediction and estimation“Stable Prediction across Unknown Environments,” (with Kun Kuang, Ruoxuan Xiong, Peng Cui, Bo Li), Knowledge Discovery & Data Mining, 2018.“Estimating Average Treatment Effects: Supplementary Analyses and Remaining Challenges,” (with Guido Imbens, Thai Pham, and Stefan Wager), American Economic Review, May 2017“A Measure of Robustness to Misspecification” (with Guido Imbens), American Economic Review, May 2015, 105 (5), 476‐480

Surrogates“Estimating Treatment Effects using Multiple Surrogates: The Role of the Surrogate Score and the Surrogate Index” (with Raj Chetty, Guido Imbens, Hyunseung Kang), 2016

Combining ML and Structural Models of Consumer Behavior“Estimating Heterogeneous Consumer Preferences for Restaurants and Travel Time Using Mobile Location Data,” (with David Blei, Robert Donnelly, Francisco Ruiz, and Tobias Schmidt), American Economic Review Papers and Proceedings, May, 2018“SHOPPER: A Probabilistic Model of Consumer Choice with Substitutes and Complements,” 2017, (with Francisco Ruiz and David Blei).“Counterfactual Inference for Consumer Choice Across Many Product Categories” (with David Blei, Rob Donnelly, Francisco Ruiz)

Generative Adversarial Networks“Using Wasserstein Generative Adversial Networks for the Design of Monte Carlo Simulations” with Guido Imbens, Jonas Metzger, Evan Munro

Causal Panel Data ModelsAthey, Bayati, Duodechenko, Khosravi, Imbens: “Matrix Completion Methods for Causal Panel Data Models” 2018Arkhangelsky, Athey, Hirschberg, Imbens, Wager: “Synthetic Difference in Differences” 2018Johannemann, Hadad, Athey, Wager: “Sufficient Representations for Categorical Variables”Xiong, Athey, Bayati, Imbens: “Optimal Experimental Designs for Staggered Rollouts” 2019

Treatment Effects, Assignment Policies“Recursive Partitioning for Heterogeneous Causal Effects” (with Guido Imbens), PNAS 2016“Estimation and Inference of Heterogeneous Treatment Effects using Random Forests” (with Stefan Wager), Journal of the American Statistical Association, 2018.“Generalized Random Forests,” with Julie Tibshirani and Stefan Wager, Annals of Statistics, 2019. “Efficient Policy Learning,” with Stefan Wager, 2017. “Offline Multi‐Action Policy Learning: Generalization and Optimization,” (with Zhengyuan Zhou and Stefan Wager)“Local Linear Forests,” (with Rina Friedberg, Julie Tibshirani, and Stefan Wager), 2018.

Bandits, Contextual Bandits“Balanced Linear Contextual Bandits,” with Maria Dimakopoulou, Zhengyuan Zhou, and Guido Imbens, Association for the Advancement of Artificial Intelligence (AAAI), 2019.Hadad, Hirschberg, Zhan, Wager, Athey, “Confidence Intervals for Policy Evaluation in Adaptive Experiments.”

The potential outcomes framework

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

Following the potential outcomes framework (Holland, 1986,Imbens and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin,1974), we posit the existence of quantities Y

(0)i and Y

(1)i .

I These correspond to the response we would have measuredgiven that the i-th subject received treatment (Wi = 1) or notreatment (Wi = 0).


For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

Goal is to estimate the conditional average treatment effect

τ (x) = E[Y (1) − Y (0)

∣∣X = x].

NB: In experiments, we only get to see Yi = Y(Wi )i .


If we make no further assumptions, estimating τ(x) is not possible.

I Literature often assumes unconfoundedness (Rosenbaumand Rubin, 1983)

{Y (0)i ,Y

(1)i }⊥⊥Wi

∣∣ Xi .

I When this assumption holds, methods based on matching orpropensity score estimation are usually consistent.

Causal Trees

Divide population into subgroups to minimize MSE in treatmenteffects

I Goal: report heterogeneity without pre-analysis plan but withvalid confidence intervals

I Moving the goalposts: method defines estimand (treatmenteffects for subgroups) and generates estimates

I Solve over-fitting problem with sample splitting: choosesubgroups in half the sample and estimate on other half

Challenges

I Objective function is infeasible:∑

i

[(τi − τ(Xi ))2

]I Need to estimate objective to optimize for it rather than take

a simple average of squared error∑

i

[(Yi − µ(Xi ))2

]I Estimand is unstable

Notation for Partitions and Leaf Effect Estimates

Three samples: model selection/tree construction: Str , estimationsample for leaf effects Sest , and a (hypothetical) test sample Ste .

Given a partition Π, τ(Xi ;Sest ,Π) is the sample average treatmenteffect in sample Sest for the leaf `(Xi ; Π) associated withcovariates Xi :

τ(Xi ;Sest ,Π) =1∑

j∈Sest∩`(Xi ;Π) Wi

∑j∈Sest∩`(Xi ;Π)

WiYi−

1∑j∈Sest∩`(Xi ;Π)(1−Wi )

∑j∈Sest∩`(Xi ;Π)

(1−Wi )Yi

Estimating the MSE Criterion

Criterion for evaluating a partition Π anticipating re-estimating leafeffects using sample splitting:

MSE (Sest ,Ste) =∑i∈Ste

(τi − τ(Xi ;Sest ,Π))2

=∑i∈Ste

(τ2i − 2 · τi · τ(Xi ;Sest ,Π) + τ2(Xi ;Sest ,Π)

)

EMSE = ESest ,Ste[MSE (Sest ,Ste)

]= VSest ,Xi

[τ(Xi ; Π,Sest)

]− EXi

[τ2(Xi ; Π)

]+ E [τ2

i ]

The last equality makes use of fact that estimates are unbiased inindependent test sample. Can construct empirical estimates ofeach of these quantities except for the last which does not dependon Π and thus does not affect partition selection.

Causal Tree Algorithm

I Divide data into tree-building Str and estimation Sest samplesI Use a greedy algorithm to recursively partition covariate spaceX into a deep partition ΠI At each node the split is selected as the one that minimizes

our estimate of EMSE over all possible binary splitsI Preserve minimum number of treated and control units in each

child leaf

I Use cross-validation to select the depth d∗ of the partitionthat minimizes an estimate of MSE of treatment effects, usingleft-out folds as proxies for the test set

I Select partition Π∗ by pruning Π to depth d∗, pruning leavesthat provide the smallest improvement in goodness of fit

I Estimate the treatment effects in each leaf of Π∗ using theestimation sample S

Causal Trees: Search Demotion Example

Causal Trees: Adaptive versus Honest Estimates

Crucial to use sample splitting!

Low-Dimensional Representations v. Fully NonparametricEstimation

Causal Trees

I Move the goalpost, but get guaranteed coverage

I Easy to interpret, easy to mis-interpret

I Can be many trees

I Leaves differ in many ways if covariates correlated; describeleaves by means in all covariates

Causal Forests

I Attempt to estimate τ(x)

I Can estimate partial effects

I In high dimensions, still can have omitted variable issues

I Confidence intervals lose coverage in high dimensions (bias)

Baseline method: k-NN matching

Consider the k-NN matching estimator for τ(x):

τ (x) =1

k

∑S1(x)

Yi −1

k

∑S0(x)

Yi ,

where S0/1(x) is the set of k-nearest cases/controls to x . This isconsistent given unconfoundedness and regularity conditions.

I Pro: Transparent asymptotics and good, robust performancewhen p is small.

I Con: Acute curse of dimensionality, even when p = 20 andn = 20k .

NB: Kernels have similar qualitative issues as k-NN.

Adaptive nearest neighbor matching

Random forests are a a popular heuristic for adaptive nearestneighbors estimation introduced by Breiman (2001).

I Pro: Excellent empirical track record.

I Con: Often used as a black box, without statistical discussion.

There has been considerable interest in using forest-like methodsfor treatment effect estimation, but without formal theory.

I Green and Kern (2012) and Hill (2011) have considered usingBayesian forest algorithms (BART, Chipman et al., 2010).

I Several authors have also studied related tree-basedmethods: Athey and Imbens (2016), Su et al. (2009), Taddyet al. (2014), Wang and Rudin (2015), Zeilis et al. (2008), ...

Wager and Athey (2018) provide the first formal results allowingrandom forest to be used for provably valid asymptotic inference.

Making k-NN matching adaptiveAthey and Imbens (2016) introduce causal tree: definesneighborhoods for matching based on recursive partitioning(Breiman, Friedman, Olshen, and Stone, 1984), advocate samplesplitting (w/ modified splitting rule) to get assumption-freeconfidence intervals for treatment effects in each leaf.

Euclidean neighborhood,for k-NN matching.

Tree-based neighborhood.

From trees to random forests (Breiman, 2001)

Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor

τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .

Random forest idea: build and average many different trees T ∗:

τ (x) =1

B

B∑b=1

T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .

· · ·

From trees to random forests (Breiman, 2001)

Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor

τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .

Random forest idea: build and average many different trees T ∗:

τ (x) =1

B

B∑b=1

T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .

We turn T into T ∗ by:

I Bagging / subsampling the training set (Breiman, 1996); thishelps smooth over discontinuities (Buhlmann and Yu, 2002).

I Selecting the splitting variable at each step from m out of prandomly drawn features (Amit and Geman, 1997).

Statistical inference with regression forests

Honest trees do not use the same data to select partition (splits)and make predictions. Ex: Split-sample trees, propensity trees.

Theorem. (Wager and Athey, JASA, 2018) Regression forests areasymptotically Gaussian and centered,

µn (x)− µ (x)

σn (x)⇒ N (0, 1) , σ2

n(x)→p 0,

given the following assumptions (+ technical conditions):

1. Honesty. Individual trees are honest.

2. Subsampling. Individual trees are built on randomsubsamples of size s � nβ, where βmin < β < 1.

3. Continuous features. The features Xi have a density that isbounded away from 0 and ∞.

4. Lipschitz response. The conditional mean functionµ(x) = E

[Y∣∣X = x

]is Lipschitz continuous.

The random forest kernel

· · ·

=⇒

Forests induce a kernel via averaging tree-based neighborhoods.This idea was used by Meinshausen (2006) for quantile regression.

Applications in Economics and Marketing

I Hitsch and Misra (2017): Use causal forests to target catalogmailings. Causal forest detects significant heterogeneity,performs better than alternatives including LASSO andoff-the-shelf random forest

I Davis and Heller (2017): Analyze heterogeneous impacts ofsummer jobs using causal forest

I Athey, Campbell, Chyn, Hastings, and White (2018): Usecausal forest to show that re-employment services didn’tbenefit in ATE, but targeted policy can have substantialbenefits

Labor Market - Reemployment ServicesAthey, Campbell, Chyn, Hastings, and White (2018)

I Goal: Increase job skills and employment for all RhodeIslanders (efficiently)

I Measure impact of employment service programsI Take advantage of a field experiment run by US Department

of Labor to measure the impact of employment services on UIand subsequent employment

I From 2005-2015, states were asked to randomly send lettersto UI claimants requiring employment services for continuedUI receipt

I Basic evaluation of 4 states finds mixed evidence on decreasein UI and earnings impacts

I In RI we find that nudge decreased weeks on UI by 1.4/21, noimpact on earnings

I Measure impact using new administrative data and causalforest (Wager and Athey 2018) to understand who benefits

I Use causal forest estimates to simulate benefits of targetedletters

HTE in Rhode Island Re-employment Services Example

ML and Structural Models:Shopping Application

Scanner data from supermarket◦ Product hierarchy (category, class, subclass, UPC)◦ Prices change Tuesday evening◦ Study 123 high‐frequency categories with 1263 UPCs◦ Multiple UPCs per category◦ Typically purchase only one UPC per trip in categroy◦ Independent price changes◦ Not too much seasonality◦ 333,000 shopping trips for ~2000 consumers over 20 months

Economic Goals:◦ Optimal pricing◦ Benefits of personalization versus simpler segmentation

Methodological Goals:◦ Contrast off‐the‐shelf ML, off‐the‐shelf econometrics with combined models

◦ Tune and test models for counterfactual performance

Joint work with Rob Donnelly, David Blei, Fran Ruiz

Combine structural model with matrix factorization techniques and computational methods from ML

Structural Model Matrix FactorizationMixed logit• User u, product i, time t

𝜇 𝜈 𝛽𝑋 𝛼 𝑝𝑈 𝜇 𝜖

• If 𝜖 i.i.d. Type I EV, then

Pr 𝑌 𝑖exp 𝜇

∑ exp 𝜇• Counterfactuals

• Out of stock• Price changes

Users

Items

𝑈 𝐼

𝑈 𝐾 𝐾 𝐼

Structural Model + FactorizationMixed logit• User u, product i, time t

𝜇 𝜈 𝜅𝑋 𝛼 𝑝𝑈 𝜇 𝜖

• If 𝜖 i.i.d. Type I EV, then

Pr 𝑌 𝑖exp 𝜇

∑ exp 𝜇• Counterfactuals

• Out of stock• Price changes

Mixed logit + factors• User u, product i, time t

𝜇 𝛽 𝜃 𝜅 𝑋 𝜌 𝛼 𝑝

• Add in nesting for outside good• Implement as two‐stage estimation with inclusive value (McFadden)

• Also factorization of outside good

Model Comparisons

Nested Factorization◦ All categories estimated in single model◦ Items substitutes within category, independent across◦ Tuned on held‐out validation set

Hierarchical Poisson Factorization (HPF)◦ All items in single model, each item independent of others◦ A form of matrix factorization allowing for covariates◦ Ignores prices◦ Scales easily

Category by category logits◦ Mixed logit (random coefficients)◦ Nested Logit◦ With various controls (demographic, etc.)

Logits with HPF Factors◦ Include user‐item prediction from HPF model

Performance by Scenario(Counterfactual)

Evaluate log‐likelihood only in weeks where an item falls into specified scenarios:

• Price changed for the item this week

• Price changed for another item in the same category this week

• Another item in the same category is out of stock at least one day this week

Traditional logits improve with HPF(ML‐based user‐item predictions)

Validation of Structural Parameter EstimatesCompare Tues‐Wed change in price to Tues‐Wed change in demand, in test setBreak out results by how price‐sensitive (elastic) we have estimated consumers to be

lessons for external validity from large scale experimentation · validity from large scale ......

Documents