targeted learning with big data
TRANSCRIPT
Targeted Learning with Big Data
Mark van der LaanUC Berkeley
Center for Philosophy and History of ScienceRevisiting the Foundations of Statistics in the Era of Big Data:
Scaling Up to Meet the Challenge
February 20, 2014
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Foundations of the statistical estimation problem
• Observed data: Realizations of random variables with aprobability distribution.
• Statistical model: Set of possible distributions for thedata-generating distribution, defined by actual knowledgeabout the data. e.g. in an RCT, we know the probability ofeach subject receiving treatment.
• Statistical target parameter: Function of thedata-generating distribution that we wish to learn from thedata.
• Estimator: An a priori-specified algorithm that takes theobserved data and returns an estimate of the targetparameter. Benchmarked by a dissimilarity-measure (e.g.,MSE) w.r.t target parameter.
• Inference: Establish limit distribution and correspondingstatistical inference.
Causal inference
• Non-testable assumptions in addition to the assumptionsdefining the statistical model. (e.g. the “no unmeasuredconfounders” assumption).
• Defines causal quantity and establishes identifiability underthese assumptions.
• This process generates interesting statistical targetparameters.
• Allows for causal interpretation of statisticalparameter/estimand.
• Even if we don’t believe the non-testable causal assumptions,the statistical estimation problem is still the same, andestimands still have valid statistical interpretations.
Targeted learning
• Define valid (and thus LARGE) statistical semi parametricmodels and interesting target parameters.
• Exactly deals with statistical challenges of high dimensionaland large data sets (Big Data).
• Avoid reliance on human art and nonrealistic (e.g.,parametric) models
• Plug-in estimator based on targeted fit of the (relevant partof) data-generating distribution to the parameter of interest
• Semiparametric efficient and robust• Statistical inference• Has been applied to: static or dynamic treatments, direct and
indirect effects, parameters of MSMs, variable importanceanalysis in genomics, longitudinal/repeated measures datawith time-dependent confounding, censoring/missingness,case-control studies, RCTs, networks.
Targeted Learning BookSpringer Series in Statisticsvan der laan & Rosetargetedlearningbook.com
• First Chapter by R.J.C.M. Starmans ”Models, Inference, andTruth” provides historical philosophical perspective onTargeted Learning.
• Discusses the erosion of the notion of model and truththroughout history and the resulting lack of unified approachin statistics.
• It stresses the importance of a reconciliation between machinelearning and statistical inference, as provided by TargetedLearning.
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Two stage methodology
• Super learning (SL) van der Laan et al. (2007),Polley et al.(2012),Polley and van der Laan (2012)
• Uses a library of candidate estimators (e.g. multiple parametricmodels, machine learning algorithms like neural networks,RandomForest, etc.)
• Builds data-adaptive weighted combination of estimators usingcross validation
• Targeted maximum likelihood estimation (TMLE) van derLaan and Rubin (2006)
• Updates initial estimate, often a Super Learner, to remove biasfor the parameter of interest
• Calculates final parameter from updated fit of thedata-generating distribution
Super learning
• No need to chose a priori a particular parametric model ormachine learning algorithm for a particular problem
• Allows one to combine many data-adaptive estimators intoone improved estimator.
• Grounded by oracle results for loss-function basedcross-validation (Van Der Laan and Dudoit (2003),van derVaart et al. (2006)). Loss function needs to be bounded.
• Performs asymptotically as well as best (oracle) weightedcombination, or achieves parametric rate of convergence.
Super learning
Figure: Relative Cross-Validated Mean Squared Error (compared to mainterms least squares regression)
Super learning
TMLE algorithm
TMLE algorithm: Formal Template
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
General Longitudinal Data Structure
We observe n i.i.d. copies of a longitudinal data structure
O = (L(0), A(0), . . . , L(K ), A(K ), Y = L(K + 1)),
where A(t) denotes a discrete valued intervention node, L(t) is anintermediate covariate realized after A(t − 1) and before A(t),t = 0, . . . , K , and Y is a final outcome of interest.
For example, A(t) = (A1(t), A2(t)) could be a vector of two binaryindicators of censoring and treatment, respectively.
Likelihood and Statistical Model
The probability distribution P0 of O can be factorized according tothe time-ordering as
P0(O) =K+1∏t=0
P0(L(t) | Pa(L(t)))K∏
t=0P0(A(t) | Pa(A(t)))
≡K+1∏t=0
Q0,L(t)(O)K∏
t=0g0,A(t)(O)
≡ Q0g0,
where Pa(L(t)) ≡ (L(t − 1), A(t − 1)) andPa(A(t)) ≡ (L(t), A(t − 1)) denote the parents of L(t) and A(t) inthe time-ordered sequence, respectively. The g0-factor representsthe intervention mechanism: e..g, treatment and right-censoringmechanism.Statistical Model: We make no assumptions on Q0, but couldmake assumptions on g0.
Statistical Target Parameter: G-computation Formula forPost-Intervention Distribution
• Let
Pd(l) =K+1∏t=0
QdL(t)(l(t)), (1)
where QdL(t)(l(t)) = QL(t)(l(t) | l(t − 1), A(t − 1) = d(t − 1)).
• Let Ld = (L(0), Ld(1), . . . , Y d = Ld(K + 1)) denote therandom variable with probability distribution Pd .
• This is the so called G-computation formula for thepost-intervention distribution corresponding with the dynamicintervention d .
Example: When to switch a failing drug regimen inHIV-infected patients
• Observed data on unit
O = (L(0), A(0), L(1), A(1), . . . , L(K ), A(K ), A2(K )Y ),
where L(0) is baseline history, A(t) = (A1(t), A2(t)), A1(t) isindicator of switching drug regimen, A2(t) is indicator ofbeing right-censored, t = 0, . . . , K , and Y is indicator ofobserving death by time K + 1.
• Define interventions nodes A(0), . . . , A(K ) and interventionsdynamic rules dθ that switch when CD4-count drops below θ,and enforces no-censoring.
• Our target parameter is defined as projection of(E (Y dθ(t)) : t, d) onto a working model mβ(θ) withparameter β.
A Sequential Regression G-computation Formula (Bang,Robins, 2005)
• By the iterative conditional expectation rule (tower rule), wehave
EPd Y d = E . . . E (E (Y d | Ld(K )) | Ld(K − 1)) . . . | L(0)).
• In addition, the conditional expectation, given Ld(K ) isequivalent with conditioning on L(K ), A(K − 1) = d(K − 1).
• In this manner, one can represent EPd Y d as an iterativeconditional expectation, first take conditional expectation,given Ld(K ) (equivalent with L(K ), A(K − 1)), then take theconditional expectation, given Ld(K − 1) (equivalent withL(K − 1), A(K − 2)), and so on, until the conditionalexpectation given L(0), and finally take the mean over L(0).
• We developed a targeted plug-in estimator/TMLE of generalsummary measures of ”dose-response” curves (EYd : d ∈ D)(Petersen et al., 2013, van der Laan, Gruber 2012).
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Variable Importance: Problem Description (Diaz, Hubbard)
• Around 800 patients that entered the emergency room withsevere trauma
• About 80 physiological and clinical variables were measured at0, 6, 12, 24, 48, and 72 hours after admission
• Objective is predicting the most likely medical outcome of apatient (e.g., survival), and provide an ordered list of thecovariates that drive this prediction (variable importance).
• This will help doctors decide what variables are relevant ateach time point.
• Variables are subject to missingness• Variables are continuous, variable importance parameter is
Ψ(P0) ≡ E0{E0(Y | A + δ, W )− E0(Y | A, W )}
for user-given value δ.
Variable Importance: Results
*
***
***
*
***
***
***
***
***
*
***
***
***
*
***
***
***
***
***
*
***
***
***
***
***
***
***
***
*
*
*
***
***
***
*
***
***
***
*
Hour 00
Hour 06
Hour 12
Hour 24
Hour 48
Hour 72
HCTHGBAPCCREAISSPTTGCSHRSBPPLTSRRBUNPTDDIMPF12NMINRFVIIIBDEPCATIIIFV
−0.04
−0.02
0
0.02
0.04
Figure: Effect sizes and significance codes
TMLE with Genomic Data
• 570 case-control samples on spina bifida• We want to identify associated genes to spina bifida from 115
SNPs.• In the original paper Shaw et. al. 2009, a univariate analysis
was performed.
• The original analysis missed rs1001761 and rs2853532 inTYMS gene because they are closely linked withcounteracting effects on spina bifida.
• With TMLE, signals from these two SNPs were recovered.• In TMEL, Q0 was obtained from LASSO, and g(W ) is
obtained from a simple regression of SNP on its two flankingmarkers to account for confounding effects of neighborhoodmarkers.
TMLE p-values for SNPs
●
● ●
●●
●
●
●
●
● ●
●
●
●●
●
●●
●
●●
●●
●
●
●
●
●
● ●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●●
●
●
●
−10
12
34
56
−log
10 p
−val
ue
rs18
8929
2hC
V176
6946
rs22
7497
6rs
1476
413
rs48
4605
1rs
1801
131
rs18
0113
3rs
4846
052
rs20
6647
0hC
V176
6974
rs53
5107
rs46
5972
4hC
V885
9565
hCV2
6588
156
hCV3
1399
971 hC
V120
0761
1hC
V265
8816
3hC
V265
8816
9hC
V313
9998
9rs
3768
139
hCV2
7498
405
rs17
7044
9hC
V265
8820
8rs
1805
087
hCV2
2275
218 rs12
6616
4rs
2229
276
hCV2
6588
230
hCV2
6588
238
hCV2
2274
168
hCV1
2005
976
hCV2
6588
247
hCV1
1374
67hC
V113
7468
hCV1
1374
88hC
V113
7493
rs82
8858
hCV9
5175
19hC
V888
4765
hCV1
9743
9rs
1801
394
rs32
6120 hC
V306
8164
rs23
0308
0rs
1006
4631
hCV3
0681
52rs
2287
779
rs16
8793
34rs
1620
48 rs37
7645
5rs
1038
0rs
1802
059
rs93
32hC
V158
0104
6hC
V978
680
hCV3
1645
81rs
5920
52hC
V978
690
hCV9
7869
3rs
5975
60hC
V978
711
hCV9
7871
9hC
V116
4660
6hC
V978
728 hC
V744
3219
hCV1
1647
016
hCV4
5961
8hC
V276
186
hCV3
2167
963
hCV1
6035
252
hCV1
1434
524
hCV1
1434
466
hCV3
1033
15hC
V743
9335
hCV7
4399
07hC
V278
6308
9hC
V652
007 hC
V759
9687
hCV7
5996
46rs
3918
211
rs20
7101
0rs
1540
087
hCV2
1394
55rs
6516
46rs
5149
33rs
2298
444
rs19
5090
2hC
V116
6079
4rs
2236
225
rs22
3622
4rs
1256
142
hCV1
5953
873
rs10
1379
21hC
V137
6145
hCV1
3761
47hC
V114
6290
8rs
5023
96rs
1001
761
rs28
4714
9rs
2853
532
hCV1
1956
188
hCV9
5169
74hC
V161
9062
7hC
V160
5450
hCV1
6054
53hC
V773
632
hCV7
7355
2hC
V160
5191
hCV1
6051
92hC
V132
5113
hCV1
1872
896
hCV1
3251
17rs
3788
189
rs37
8819
0 rs23
3018
3
MTH
FRM
THFR
MTH
FRM
THFR
MTH
FRM
THFR
MTH
FRM
THFR
MTH
FRM
THFR
MTH
FRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
TRM
THFD
2M
THFD
2M
THFD
2M
THFD
2M
THFD
2M
THFD
2M
THFD
2M
THFD
2M
TRR
MTR
RM
TRR
MTR
RM
TRR
MTR
RM
TRR
MTR
RM
TRR
MTR
RM
TRR
MTR
RM
TRR
BHM
T2BH
MT2
BHM
T2BH
MT2
BHM
T2BH
MT2
BHM
T2BH
MT
BHM
TBH
MT
BHM
TBH
MT
BHM
TBH
MT
BHM
TDH
FRDH
FRDH
FRDH
FRDH
FRDH
FRDH
FRDH
FRDH
FRNO
S3NO
S3NO
S3FO
LR1
FOLR
1FO
LR1
FOLR
2FO
LR2
FOLR
2M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1M
THFD
1TY
MS
TYM
STY
MS
TYM
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SCB
SRF
C1RF
C1RF
C1RF
C1RF
C1RF
C1
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Targeted Learning of Data Dependent Target Parameters(vdL, Hubbard, 2013)
• Define algorithms that map data into a target parameter:ΨPn : M→ IRd , thereby generalizing the notion of targetparameters.
• Develop methods to obtain statistical inference for ΨPn(P0) or1/V
∑Vv=1 ΨPn,v (P0), where Pn,v is the empirical distribution
of parameter-generating sample corresponding with v -thsample split. We have developed cross-validated TMLE forthe latter data dependent target parameter, without anyadditional conditions.
• In particular, this generalized framework allows us to generatea subset of target parameters among a massive set ofcandidate target parameters, while only having to deal withmultiple testing for the data adaptively selected set of targetparameters.
• Thus, much more powerful than regular multiple testing for afixed set of null hypotheses.
Online Targeted MLE: Ongoing work
• Order data if not ordered naturally.• Partition in subsets numbered from 1 to K .• Initiate initial estimator and TMLE based on first subset.• Update initial estimator based on second subset, and update
TMLE based on second subset.• Iterate till last subset.• Final estimator is average of all stage specific TMLEs.• In this manner, for each subset number of calculations is
bounded by number of observations in subset and totalcomputation time increases linearly in number of subsets.
• One can still prove asymptotic efficiency of this online TMLE.
Outline
1 Targeted Learning
2 Two stage methodology: Super Learning+ TMLE
3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions
4 Variable importance analysis examples of Targeted Learning
5 Scaling up Targeted Learning to handle Big Data
6 Concluding remarks
Concluding remarks
• Sound foundations of statistics are in place (Data is randomvariable, Model, Target Parameter, Inference based on LimitDistribution), but these have eroded over many decades.
• However, MLE had to be revamped into TMLE to deal withlarge models.
• Big Data asks for development of fast TMLE without givingup on statistical properties: e.g., Online TMLE.
• Big Data asks for research teams that consists of topstatisticians and computer scientists, beyond subject matterexperts.
• Philosophical soundness of proposed methods are hugelyimportant and should become a norm.
References I
H. Bang and J. Robins. Doubly robust estimation in missing dataand causal inference models. Biometrics, 61:962–972, 2005.
M. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, andM. van der Laan. Targeted minimum loss based estimation ofmarginal structural working models. Journal of CausalInference, submitted, 2013.
E. Polley and M. van der Laan. SuperLearner: Super LearnerPrediction, 2012. URLhttp://CRAN.R-project.org/package=SuperLearner. Rpackage version 2.0-6.
E. Polley, S. Rose, and M. van der Laan. Super learning. Chapter3 in Targeted Learning, van der Laan,Rose (2012), 2012.
References II
M. Schnitzer, E. Moodie, M. van der Laan, R. Platt, and M. Klein.Marginal structural modeling of a survival outcome withtargeted maximum likelihood estimation. Biometrics, to appear,2013.
M. Van Der Laan and S. Dudoit. Unified cross-validationmethodology for selection among estimators and a generalcross-validated adaptive epsilon-net estimator: Finite sampleoracle inequalities and examples. UC Berkeley Division ofBiostatistics Working Paper Series, page 130, 2003.
M. J. van der Laan and S. Gruber. Targeted Minimum Loss BasedEstimation of Causal Effects of Multiple Time PointInterventions. The International Journal of Biostatistics, 8(1),Jan. 2012. ISSN 1557-4679. doi: 10.1515/1557-4679.1370.
References III
M. J. van der Laan and D. Rubin. Targeted Maximum LikelihoodLearning. The International Journal of Biostatistics, 2(1), Jan.2006. ISSN 1557-4679. doi: 10.2202/1557-4679.1043.
M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner.Statistical applications in genetics and molecular biology, 6(1),Jan. 2007. ISSN 1544-6115. doi: 10.2202/1544-6115.1309.
A. van der Vaart, S. Dudoit, and M. van der Laan. Oracleinequalities for multi-fold cross-validation. Stat Decis, 24(3):351–371, 2006.