targeted learning with big data

Targeted Learning with Big Data

Mark van der LaanUC Berkeley

Center for Philosophy and History of ScienceRevisiting the Foundations of Statistics in the Era of Big Data:

Scaling Up to Meet the Challenge

February 20, 2014

Outline

1 Targeted Learning

2 Two stage methodology: Super Learning+ TMLE

3 Definition of Estimation Problem for Causal Effects of MultipleTime Point Interventions

4 Variable importance analysis examples of Targeted Learning

5 Scaling up Targeted Learning to handle Big Data

6 Concluding remarks

Foundations of the statistical estimation problem

• Observed data: Realizations of random variables with aprobability distribution.

• Statistical model: Set of possible distributions for thedata-generating distribution, defined by actual knowledgeabout the data. e.g. in an RCT, we know the probability ofeach subject receiving treatment.

• Statistical target parameter: Function of thedata-generating distribution that we wish to learn from thedata.

• Estimator: An a priori-specified algorithm that takes theobserved data and returns an estimate of the targetparameter. Benchmarked by a dissimilarity-measure (e.g.,MSE) w.r.t target parameter.

• Inference: Establish limit distribution and correspondingstatistical inference.

Causal inference

• Non-testable assumptions in addition to the assumptionsdefining the statistical model. (e.g. the “no unmeasuredconfounders” assumption).

• Defines causal quantity and establishes identifiability underthese assumptions.

• This process generates interesting statistical targetparameters.

• Allows for causal interpretation of statisticalparameter/estimand.

• Even if we don’t believe the non-testable causal assumptions,the statistical estimation problem is still the same, andestimands still have valid statistical interpretations.

Targeted learning

• Define valid (and thus LARGE) statistical semi parametricmodels and interesting target parameters.

• Exactly deals with statistical challenges of high dimensionaland large data sets (Big Data).

• Avoid reliance on human art and nonrealistic (e.g.,parametric) models

• Plug-in estimator based on targeted fit of the (relevant partof) data-generating distribution to the parameter of interest

• Semiparametric efficient and robust• Statistical inference• Has been applied to: static or dynamic treatments, direct and

indirect effects, parameters of MSMs, variable importanceanalysis in genomics, longitudinal/repeated measures datawith time-dependent confounding, censoring/missingness,case-control studies, RCTs, networks.

Targeted Learning BookSpringer Series in Statisticsvan der laan & Rosetargetedlearningbook.com

• First Chapter by R.J.C.M. Starmans ”Models, Inference, andTruth” provides historical philosophical perspective onTargeted Learning.

• Discusses the erosion of the notion of model and truththroughout history and the resulting lack of unified approachin statistics.

• It stresses the importance of a reconciliation between machinelearning and statistical inference, as provided by TargetedLearning.

Outline

1 Targeted Learning






Two stage methodology

• Super learning (SL) van der Laan et al. (2007),Polley et al.(2012),Polley and van der Laan (2012)

• Uses a library of candidate estimators (e.g. multiple parametricmodels, machine learning algorithms like neural networks,RandomForest, etc.)

• Builds data-adaptive weighted combination of estimators usingcross validation

• Targeted maximum likelihood estimation (TMLE) van derLaan and Rubin (2006)

• Updates initial estimate, often a Super Learner, to remove biasfor the parameter of interest

• Calculates final parameter from updated fit of thedata-generating distribution

Super learning

• No need to chose a priori a particular parametric model ormachine learning algorithm for a particular problem

• Allows one to combine many data-adaptive estimators intoone improved estimator.

• Grounded by oracle results for loss-function basedcross-validation (Van Der Laan and Dudoit (2003),van derVaart et al. (2006)). Loss function needs to be bounded.

• Performs asymptotically as well as best (oracle) weightedcombination, or achieves parametric rate of convergence.

Super learning

Figure: Relative Cross-Validated Mean Squared Error (compared to mainterms least squares regression)

Super learning

TMLE algorithm

TMLE algorithm: Formal Template

Outline

1 Targeted Learning






General Longitudinal Data Structure

We observe n i.i.d. copies of a longitudinal data structure

O = (L(0), A(0), . . . , L(K ), A(K ), Y = L(K + 1)),

where A(t) denotes a discrete valued intervention node, L(t) is anintermediate covariate realized after A(t − 1) and before A(t),t = 0, . . . , K , and Y is a final outcome of interest.

For example, A(t) = (A1(t), A2(t)) could be a vector of two binaryindicators of censoring and treatment, respectively.

Likelihood and Statistical Model

The probability distribution P0 of O can be factorized according tothe time-ordering as

P0(O) =K+1∏t=0

P0(L(t) | Pa(L(t)))K∏

t=0P0(A(t) | Pa(A(t)))

≡K+1∏t=0

Q0,L(t)(O)K∏

t=0g0,A(t)(O)

≡ Q0g0,

where Pa(L(t)) ≡ (L(t − 1), A(t − 1)) andPa(A(t)) ≡ (L(t), A(t − 1)) denote the parents of L(t) and A(t) inthe time-ordered sequence, respectively. The g0-factor representsthe intervention mechanism: e..g, treatment and right-censoringmechanism.Statistical Model: We make no assumptions on Q0, but couldmake assumptions on g0.

Statistical Target Parameter: G-computation Formula forPost-Intervention Distribution

• Let

Pd(l) =K+1∏t=0

QdL(t)(l(t)), (1)

where QdL(t)(l(t)) = QL(t)(l(t) | l(t − 1), A(t − 1) = d(t − 1)).

• Let Ld = (L(0), Ld(1), . . . , Y d = Ld(K + 1)) denote therandom variable with probability distribution Pd .

• This is the so called G-computation formula for thepost-intervention distribution corresponding with the dynamicintervention d .

Example: When to switch a failing drug regimen inHIV-infected patients

• Observed data on unit

O = (L(0), A(0), L(1), A(1), . . . , L(K ), A(K ), A2(K )Y ),

where L(0) is baseline history, A(t) = (A1(t), A2(t)), A1(t) isindicator of switching drug regimen, A2(t) is indicator ofbeing right-censored, t = 0, . . . , K , and Y is indicator ofobserving death by time K + 1.

• Define interventions nodes A(0), . . . , A(K ) and interventionsdynamic rules dθ that switch when CD4-count drops below θ,and enforces no-censoring.

• Our target parameter is defined as projection of(E (Y dθ(t)) : t, d) onto a working model mβ(θ) withparameter β.

A Sequential Regression G-computation Formula (Bang,Robins, 2005)

• By the iterative conditional expectation rule (tower rule), wehave

EPd Y d = E . . . E (E (Y d | Ld(K )) | Ld(K − 1)) . . . | L(0)).

• In addition, the conditional expectation, given Ld(K ) isequivalent with conditioning on L(K ), A(K − 1) = d(K − 1).

• In this manner, one can represent EPd Y d as an iterativeconditional expectation, first take conditional expectation,given Ld(K ) (equivalent with L(K ), A(K − 1)), then take theconditional expectation, given Ld(K − 1) (equivalent withL(K − 1), A(K − 2)), and so on, until the conditionalexpectation given L(0), and finally take the mean over L(0).

• We developed a targeted plug-in estimator/TMLE of generalsummary measures of ”dose-response” curves (EYd : d ∈ D)(Petersen et al., 2013, van der Laan, Gruber 2012).

Outline

1 Targeted Learning






Variable Importance: Problem Description (Diaz, Hubbard)

• Around 800 patients that entered the emergency room withsevere trauma

• About 80 physiological and clinical variables were measured at0, 6, 12, 24, 48, and 72 hours after admission

• Objective is predicting the most likely medical outcome of apatient (e.g., survival), and provide an ordered list of thecovariates that drive this prediction (variable importance).

• This will help doctors decide what variables are relevant ateach time point.

• Variables are subject to missingness• Variables are continuous, variable importance parameter is

Ψ(P0) ≡ E0{E0(Y | A + δ, W )− E0(Y | A, W )}

for user-given value δ.

Variable Importance: Results

*

***

***

*

***

***

***

***

***

*

***

***

***

*

***

***

***

***

***

*

***

***

***

***

***

***

***

***

*

*

*

***

***

***

*

***

***

***

*

Hour 00

Hour 06

Hour 12

Hour 24

Hour 48

Hour 72

HCTHGBAPCCREAISSPTTGCSHRSBPPLTSRRBUNPTDDIMPF12NMINRFVIIIBDEPCATIIIFV

−0.04

−0.02

0

0.02

0.04

Figure: Effect sizes and significance codes

TMLE with Genomic Data

• 570 case-control samples on spina bifida• We want to identify associated genes to spina bifida from 115

SNPs.• In the original paper Shaw et. al. 2009, a univariate analysis

was performed.

• The original analysis missed rs1001761 and rs2853532 inTYMS gene because they are closely linked withcounteracting effects on spina bifida.

• With TMLE, signals from these two SNPs were recovered.• In TMEL, Q0 was obtained from LASSO, and g(W ) is

obtained from a simple regression of SNP on its two flankingmarkers to account for confounding effects of neighborhoodmarkers.

TMLE p-values for SNPs

●

● ●

●●

●

●

●

●

● ●

●

●

●●

●

●●

●

●●

●●

●

●

●

●

●

● ●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●● ● ●

● ● ● ●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●●

●

●

●

−10

12

34

56

−log

10 p

−val

ue

rs18

8929

2hC

V176

6946

rs22

7497

6rs

1476

413

rs48

4605

1rs

1801

131

rs18

0113

3rs

4846

052

rs20

6647

0hC

V176

6974

rs53

5107

rs46

5972

4hC

V885

9565

hCV2

6588

156

hCV3

1399

971 hC

V120

0761

1hC

V265

8816

3hC

V265

8816

9hC

V313

9998

9rs

3768

139

hCV2

7498

405

rs17

7044

9hC

V265

8820

8rs

1805

087

hCV2

2275

218 rs12

6616

4rs

2229

276

hCV2

6588

230

hCV2

6588

238

hCV2

2274

168

hCV1

2005

976

hCV2

6588

247

hCV1

1374

67hC

V113

7468

hCV1

1374

88hC

V113

7493

rs82

8858

hCV9

5175

19hC

V888

4765

hCV1

9743

9rs

1801

394

rs32

6120 hC

V306

8164

rs23

0308

0rs

1006

4631

hCV3

0681

52rs

2287

779

rs16

8793

34rs

1620

48 rs37

7645

5rs

1038

0rs

1802

059

rs93

32hC

V158

0104

6hC

V978

680

hCV3

1645

81rs

5920

52hC

V978

690

hCV9

7869

3rs

5975

60hC

V978

711

hCV9

7871

9hC

V116

4660

6hC

V978

728 hC

V744

3219

hCV1

1647

016

hCV4

5961

8hC

V276

186

hCV3

2167

963

hCV1

6035

252

hCV1

1434

524

hCV1

1434

466

hCV3

1033

15hC

V743

9335

hCV7

4399

07hC

V278

6308

9hC

V652

007 hC

V759

9687

hCV7

5996

46rs

3918

211

rs20

7101

0rs

1540

087

hCV2

1394

55rs

6516

46rs

5149

33rs

2298

444

rs19

5090

2hC

V116

6079

4rs

2236

225

rs22

3622

4rs

1256

142

hCV1

5953

873

rs10

1379

21hC

V137

6145

hCV1

3761

47hC

V114

6290

8rs

5023

96rs

1001

761

rs28

4714

9rs

2853

532

hCV1

1956

188

hCV9

5169

74hC

V161

9062

7hC

V160

5450

hCV1

6054

53hC

V773

632

hCV7

7355

2hC

V160

5191

hCV1

6051

92hC

V132

5113

hCV1

1872

896

hCV1

3251

17rs

3788

189

rs37

8819

0 rs23

3018

3

MTH

FRM

THFR

MTH

FRM

THFR

MTH

FRM

THFR

MTH

FRM

THFR

MTH

FRM

THFR

MTH

FRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

TRM

THFD

2M

THFD

2M

THFD

2M

THFD

2M

THFD

2M

THFD

2M

THFD

2M

THFD

2M

TRR

MTR

RM

TRR

MTR

RM

TRR

MTR

RM

TRR

MTR

RM

TRR

MTR

RM

TRR

MTR

RM

TRR

BHM

T2BH

MT2

BHM

T2BH

MT2

BHM

T2BH

MT2

BHM

T2BH

MT

BHM

TBH

MT

BHM

TBH

MT

BHM

TBH

MT

BHM

TDH

FRDH

FRDH

FRDH

FRDH

FRDH

FRDH

FRDH

FRDH

FRNO

S3NO

S3NO

S3FO

LR1

FOLR

1FO

LR1

FOLR

2FO

LR2

FOLR

2M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1M

THFD

1TY

MS

TYM

STY

MS

TYM

SCB

SCB

SCB

SCB

SCB

SCB

SCB

SCB

SCB

SRF

C1RF

C1RF

C1RF

C1RF

C1RF

C1

Outline

1 Targeted Learning






Targeted Learning of Data Dependent Target Parameters(vdL, Hubbard, 2013)

• Define algorithms that map data into a target parameter:ΨPn : M→ IRd , thereby generalizing the notion of targetparameters.

• Develop methods to obtain statistical inference for ΨPn(P0) or1/V

∑Vv=1 ΨPn,v (P0), where Pn,v is the empirical distribution

of parameter-generating sample corresponding with v -thsample split. We have developed cross-validated TMLE forthe latter data dependent target parameter, without anyadditional conditions.

• In particular, this generalized framework allows us to generatea subset of target parameters among a massive set ofcandidate target parameters, while only having to deal withmultiple testing for the data adaptively selected set of targetparameters.

• Thus, much more powerful than regular multiple testing for afixed set of null hypotheses.

Online Targeted MLE: Ongoing work

• Order data if not ordered naturally.• Partition in subsets numbered from 1 to K .• Initiate initial estimator and TMLE based on first subset.• Update initial estimator based on second subset, and update

TMLE based on second subset.• Iterate till last subset.• Final estimator is average of all stage specific TMLEs.• In this manner, for each subset number of calculations is

bounded by number of observations in subset and totalcomputation time increases linearly in number of subsets.

• One can still prove asymptotic efficiency of this online TMLE.

Outline

1 Targeted Learning






Concluding remarks

• Sound foundations of statistics are in place (Data is randomvariable, Model, Target Parameter, Inference based on LimitDistribution), but these have eroded over many decades.

• However, MLE had to be revamped into TMLE to deal withlarge models.

• Big Data asks for development of fast TMLE without givingup on statistical properties: e.g., Online TMLE.

• Big Data asks for research teams that consists of topstatisticians and computer scientists, beyond subject matterexperts.

• Philosophical soundness of proposed methods are hugelyimportant and should become a norm.

References I

H. Bang and J. Robins. Doubly robust estimation in missing dataand causal inference models. Biometrics, 61:962–972, 2005.

M. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, andM. van der Laan. Targeted minimum loss based estimation ofmarginal structural working models. Journal of CausalInference, submitted, 2013.

E. Polley and M. van der Laan. SuperLearner: Super LearnerPrediction, 2012. URLhttp://CRAN.R-project.org/package=SuperLearner. Rpackage version 2.0-6.

E. Polley, S. Rose, and M. van der Laan. Super learning. Chapter3 in Targeted Learning, van der Laan,Rose (2012), 2012.

http://CRAN.R-project.org/package=SuperLearner

References II

M. Schnitzer, E. Moodie, M. van der Laan, R. Platt, and M. Klein.Marginal structural modeling of a survival outcome withtargeted maximum likelihood estimation. Biometrics, to appear,2013.

M. Van Der Laan and S. Dudoit. Unified cross-validationmethodology for selection among estimators and a generalcross-validated adaptive epsilon-net estimator: Finite sampleoracle inequalities and examples. UC Berkeley Division ofBiostatistics Working Paper Series, page 130, 2003.

M. J. van der Laan and S. Gruber. Targeted Minimum Loss BasedEstimation of Causal Effects of Multiple Time PointInterventions. The International Journal of Biostatistics, 8(1),Jan. 2012. ISSN 1557-4679. doi: 10.1515/1557-4679.1370.

References III

M. J. van der Laan and D. Rubin. Targeted Maximum LikelihoodLearning. The International Journal of Biostatistics, 2(1), Jan.2006. ISSN 1557-4679. doi: 10.2202/1557-4679.1043.

M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner.Statistical applications in genetics and molecular biology, 6(1),Jan. 2007. ISSN 1544-6115. doi: 10.2202/1544-6115.1309.

A. van der Vaart, S. Dudoit, and M. van der Laan. Oracleinequalities for multi-fold cross-validation. Stat Decis, 24(3):351–371, 2006.

targeted learning with big data

Documents