rms - university of kentuckymai/sta665/course2.pdf · tt in g o r di n a l pre d i c to rs. . . . ....

RegressionModelingStrategies

usingtheRPackagerms

Frank

EHarrellJr

Departm

entof

Biostatistics

Vanderbilt

University

School

ofMedicine

Nashville

TN

37232

[email protected]

biostat.mc.vanderbilt.edu/rms

useR!

TheRUserConference

UniversityofW

arwick

CoventryUK

15August

2011

Copyright1995-2011FEHarrell

AllRights

Reserved

Contents

1Introduction

3

1.1

HypothesisTesting,Estimation,

andPrediction.

3

1.2

Examples

ofUsesof

PredictiveMultivariable

Modeling

....................

5

1.3

Misunderstandings

aboutPredictionvs.Classi-

fication

.....................

6

1.4

PlanningforModeling

.............

10

1.5

Choiceof

theModel

..............

14

1.6

Modeluncertainty/Data-driven

ModelSpeci-

fication

.....................

15

2GeneralAsp

ectsofFittingRegressionModels

16

2.1

NotationforMultivariableRegressionModels..

16

2.2

ModelFormulations

...............

17

ii

2.3

Interpreting

ModelParam

eters

.........

18

2.3.1

Nom

inalPredictors

...........

19

2.3.2

Interactions

...............

20

2.3.3

Example:

InferenceforaSimpleModel

.21

2.4

Reviewof

Com

posite(Chunk)Tests

......

25

2.5

RelaxingLinearity

Assum

ptionforContinuous

Predictors....................

25

2.5.1

AvoidingCategorization.........

25

2.5.2

SimpleNonlinearTerms.........

30

2.5.3

Splines

forEstimatingShapeof

Regres-

sion

FunctionandDeterminingPredic-

torTransform

ations

...........

31

2.5.4

CubicSplineFunctions

.........

34

2.5.5

RestrictedCubicSplines

........

34

2.5.6

ChoosingNum

ber

andPositionof

Knots

38

2.5.7

NonparametricRegression

.......

40

2.5.8

Advantagesof

RegressionSplines

over

Other

Methods

.............

42

iii

2.6

Recursive

Partitioning:

Tree-Based

Models...

43

2.7

New

Directionsin

PredictiveModeling

.....

45

2.8

MultipleDegreeof

Freedom

Tests

ofAssociation

48

2.9

Assessm

entof

ModelFit

............

50

2.9.1

RegressionAssum

ptions

.........

50

2.9.2

ModelingandTesting

Com

plex

Interac-

tions

..................

55

2.9.3

Fitting

OrdinalPredictors........

58

2.9.4

DistributionalAssum

ptions

.......

58

3Multivariable

ModelingStrategies

60

3.1

Prespecification

ofPredictor

Com

plexityWith-

outLater

Simplification

.............

61

3.1.1

LearningFrom

aSaturated

Model

...

62

3.1.2

Using

MarginalGeneralized

RankCor-

relations.................

63

3.2

CheckingAssum

ptions

ofMultiplePredictorsSi-

multaneously

..................

65

3.3

VariableSelection

................

65

iv

3.4

OverfittingandLimitson

Num

ber

ofPredictors

69

3.5

Shrinkage

....................

70

3.6

Collinearity

...................

72

3.7

DataReduction

.................

74

3.7.1

RedundancyAnalysis

..........

75

3.7.2

VariableClustering

...........

76

3.7.3

Transform

ationandScalingVariablesWith-

outUsing

Y..............

77

3.7.4

SimultaneousTransform

ationandIm

pu-

tation

..................

79

3.7.5

SimpleScoring

ofVariableClusters...

84

3.7.6

Simplifying

Cluster

Scores........

85

3.7.7

How

MuchDataReduction

IsNecessary?

85

3.8

OverlyInfluentialObservations.........

88

3.9

Com

paring

TwoModels.............

90

3.10

Sum

mary:

PossibleModelingStrategies

....

92

3.10.1

DevelopingPredictiveModels......

94

3.10.2

DevelopingModelsforEffectEstimation

97

v

3.10.3

DevelopingModelsforHypothesisTesting

98

4Describing,Resampling,Validating,andSim

pli-

fyingtheModel

99

4.1

DescribingtheFittedModel

..........

99

4.1.1

Interpreting

Effects

...........

99

4.1.2

Indexesof

ModelPerform

ance

.....100

4.2

The

Bootstrap

.................103

4.3

ModelValidation

................108

4.3.1

Introduction

...............108

4.3.2

Which

QuantitiesShouldBeUsedinVal-

idation?

.................109

4.3.3

Data-Splitting

..............110

4.3.4

Improvem

entson

Data-Splitting:Resam

-pling

..................112

4.3.5

ValidationUsing

theBootstrap

.....113

4.4

Simplifying

theFinalModelby

ApproximatingIt

118

4.4.1

DifficultiesUsing

FullModels......118

vi

4.4.2

ApproximatingtheFullModel

.....119

4.5

How

DoWeBreak

Bad

Habits?

........120

5SSoftware

122

5.1

The

SModelingLanguage

...........123

5.2

User-Contributed

Functions

...........124

5.3

The

rmsPackage

................126

5.4

Other

Functions

................131

6LogisticModelCase

Study:SurvivalofTitanic

Passengers

132

6.1

Descriptive

Statistics

..............132

6.2

Exploring

TrendswithNonparametricRegression135

6.3

BinaryLogisticModel

withCasew

iseDeletion

ofMissing

Values................136

6.4

ExaminingMissing

DataPatterns........142

6.5

SingleConditionalMeanIm

putation

......146

6.6

MultipleIm

putation

...............150

6.7

Sum

marizingtheFittedModel

.........153

vii

7Case

Studyin

ParametricSurvivalModelingand

ModelApproxim

ation

157

7.1

Descriptive

Statistics

..............158

7.2

CheckingAdequacyof

Log-NormalAccelerated

Failure

TimeModel

...............163

7.3

Sum

marizingtheFittedModel

.........173

7.4

Internal

Validationof

theFittedModel

Using

theBootstrap

..................174

7.5

ApproximatingtheFullModel

.........178

Bibliography

.....................191

viii

1

CoursePhiloso

phy

�Satisfactionof

modelassumptions

improves

preci-

sion

andincreasesstatisticalpow

er

�It

ismoreproductive

tomakeamodel

fitstep

bystep

(e.g.,transformationestimation)

than

topostulate

asimplemodelandfind

outwhatwent

wrong

�Graphical

methods

should

bemarried

toform

alinference

�Overfittingoccurs

frequently,so

data

reduction

andmodelvalidationareimportant

�Softwarewithout

multiplefacilitiesforassessing

andfixing

model

fitmay

only

seem

tobeuser-

friendly

�Carefullyfittingan

improper

modelisbetterthan

badlyfitting(and

overfitting)

awell-chosen

one

�Methods

which

workforalltypes

ofregression

modelsarethemostvaluable.

�In

mostresearch

projects

thecost

ofdata

collec-

tion

faroutweighsthecostofdata

analysis,so

itis

2

important

tousethemosteffi

cientandaccurate

modelingtechniques,to

avoidcategorizing

contin-

uous

variables,andto

notremovedata

from

the

estimationsamplejust

tobeableto

validatethe

model.

�The

bootstrap

isabreakthrough

forstatisticalm

od-

elingandmodelvalidation.

�Using

thedata

toguidethedata

analysisisalmost

asdangerousas

notdoingso.

�A

good

overallstrategy

isto

decide

how

many

degreesof

freedom

(i.e.,number

ofregression

pa-

rameters)

canbe“spent”,where

they

should

be

spent,to

spendthem

withno

regrets.

See

theexcellent

text

ClinicalPredictionModelsby

Steyerberg104 .

Chapter1

Introduction

1.1

Hypoth

esisTesting,Estim

ation,and

Prediction

EvenwhenonlytestingH0amodelbasedapproach

hasadvantages:

�Permutationandrank

testsnotas

useful

foresti-

mation

�Cannotreadily

beextended

toclustersamplingor

repeatedmeasurements

�Modelsgeneralizetests

–2-samplet-test,ANOVA→

multiplelinearregression

3

CHAPTER

1.

INTRODUCTIO

N4

–Wilcoxon,Kruskal-W

allis,Spearm

an→

proportionalodds

ordinallogisticmodel

–log-rank→

Cox

�Modelsnotonlyallowformultiplicityadjustment

butforshrinkageof

estimates

–Statisticians

comfortable

withP-value

adjust-

mentbutfailto

recognizethat

thedifference

betweenthemostdifferenttreatm

ents

isbadly

biased

Statisticalestimationisusually

model-based

�Relativeeffectof

increasing

cholesterolfrom

200

to250mg/dl

onhazard

ofdeath,

holdingother

risk

factorsconstant

�Adjustm

entdepends

onhowotherrisk

factorsre-

late

tohazard

�Usuallyinterested

inadjusted

(partial)effects,not

unadjusted

(marginalor

crude)

effects

CHAPTER

1.

INTRODUCTIO

N5

1.2

ExamplesofUsesofPredictiveM

ultivariable

Modeling

�Financialperform

ance,consum

erpurchasing,loan

pay-back

�Ecology

�Product

life

�Employmentdiscrimination

�Medicine,epidem

iology,health

services

research

�Probabilityof

diagnosis,timecourse

ofadisease

�Com

paring

non-random

ized

treatm

ents

�Getting

thecorrectestimateof

relative

effects

inrandom

ized

studiesrequires

covariableadjustment

ifmodelisnonlinear

–Crude

odds

ratios

biased

towards

1.0ifsample

heterogeneous

�Estimatingabsolute

treatm

enteffect(e.g.,

risk

difference)

–Use

e.g.difference

intwopredictedprobabilities

�Cost-effectiveness

ratios

CHAPTER

1.

INTRODUCTIO

N6

–increm

entalcost/increm

entalA

BSOLUTEben-

efit

–moststudiesuseavg.costdifference

/avg.ben-

efit,which

may

applyto

noone

1.3

MisunderstandingsaboutPrediction

vs.

Classifica

tion

�Manyanalysts

desire

todevelop“classifiers”

in-

steadof

predictions

�Supposethat

1.response

variableisbinary

2.thetwolevelsrepresentasharpdichotom

ywith

nogray

zone

(e.g.,completesuccessvs.total

failure

withno

possibilityof

apartialsuccess)

3.oneisforced

toassign

(classify)

future

observa-

tionsto

onlythesetwochoices

4.thecostofmisclassification

isthesameforevery

future

observation,

andtheratioof

thecost

ofafalsepositiveto

thecost

ofafalsenegative

equalsthe(often

hidden)ratioimpliedby

the

analyst’sclassification

rule

CHAPTER

1.

INTRODUCTIO

N7

�Thenclassification

isstillsuboptimalfordriving

thedevelopm

entof

apredictive

instrumentas

well

asforhypothesistestingandestimation

�Farbetteristo

usethefullinform

ationinthedata

todevelopeaprobabilitymodel,then

developclas-

sification

ruleson

thebasisof

estimated

probabil-

ities

–↑pow

er,↑precision

�Classification

ismoreproblematicifresponse

vari-

able

isordinalor

continuous

orthegroups

are

nottrulydistinct

(e.g.,diseaseor

nodiseasewhen

severityofdiseaseison

acontinuum);dichotom

iz-

ingitup

frontfortheanalysisisnotappropriate

–minimumlossofinform

ation(w

hendichotom

iza-

tion

isat

themedian)

islarge

–may

requirethesamplesize

toincrease

many–

fold

tocompensate

forloss

ofinform

ation46

�Two-groupclassification

representsartificialforced

choice

–bestoption

may

be“nochoice,getmoredata”

CHAPTER

1.

INTRODUCTIO

N8

�Unlikeprediction

(e.g.,

ofabsolute

risk),

classi-

fication

implicitly

uses

utility

(loss;

cost

offalse

positiveor

falsenegative)functions

�Hiddenproblems:

–Utilityfunction

depends

onvariablesnotcol-

lected

(subjects’preferences)

that

areavailable

onlyat

thedecision

point

–Assum

eseverysubjecthasthesameutilityfunc-

tion

–Assum

esthisfunction

coincideswiththeana-

lyst’s

�Formaldecision

analysisuses

–optimum

predictionsusingallavailabledata

–subject-specificutilities,which

areoftenbased

onvariablesnotpredictive

oftheoutcom

e

�ROCanalysisismisleadingexcept

forthespecial

case

ofmassone-timegroupdecision

makingwith

unknow

ableutilities

See

15,19,43,49,50,113.

CHAPTER

1.

INTRODUCTIO

N9

Accuracyscoreused

todrivemodelbuildingshould

beacontinuous

scorethat

utilizesalloftheinform

a-tion

inthedata.

TheDichoto

mizingM

oto

rist

�The

speedlim

itis60.

�Iam

goingfaster

than

thespeedlim

it.

�WillIbecaught?

Ananswer

byadichotom

izer:

�Are

yougoingfaster

than

70?

Ananswer

from

abetterdichotom

izer:

�Ifyouaream

ongothercars,areyougoingfaster

than

73?

�Ifyouareexposed

areyour

goingfaster

than

67?

Better:

�How

fast

areyougoingandareyouexposed?

CHAPTER

1.

INTRODUCTIO

N10

Analogy

tomostmedicaldiagnosisresearch

inwhich

+/-

diagnosisisafalsedichotom

yof

anunderlying

diseaseseverity:

�The

speedlim

itismoderatelyhigh.

�Iam

goingfairlyfast.

�WillIbecaught?

1.4

PlanningforM

odeling

�Chancethat

predictive

modelwillbeused

94

�Response

definition,follow-up

�Variabledefinitions

�Observervariability

�Missing

data

�Preferenceforcontinuous

variables

�Subjects

�Sites

CHAPTER

1.

INTRODUCTIO

N11

Whatcankeep

asampleof

data

from

being

appro-

priate

formodeling:

1.Mostimportant

predictororresponse

variablesnot

collected

2.Subjectsin

thedatasetareill-definedor

notrep-

resentativeof

thepopulationto

which

inferences

areneeded

3.Datacollectionsitesdo

notrepresentthepopula-

tion

ofsites

4.Key

variablesmissing

inlargenumbersof

subjects

5.Datanotmissing

atrandom

6.Nooperationaldefinitionsforkeyvariablesand/or

measurementerrorssevere

7.Noobserver

variability

studiesdone

Whatelse

cango

wrong

inmodeling?

1.The

processgenerating

thedata

isnotstable.

2.The

model

ismisspecified

withregard

tonon-

linearities

orinteractions,or

therearepredictors

CHAPTER

1.

INTRODUCTIO

N12

missing.

3.The

model

ismisspecified

interm

sof

thetrans-

form

ationof

theresponse

variableor

themodel’s

distributionalassumptions.

4.The

modelcontains

discontinuities(e.g.,by

cate-

gorizing

continuous

predictorsor

fittingregression

shapes

withsudden

changes)

that

canbegamed

byusers.

5.Correlationsam

ongsubjects

arenotspecified,or

thecorrelationstructureismisspecified,resulting

inineffi

cientparameter

estimates

andoverconfi-

dent

inference.

6.The

model

isoverfitted,resultingin

predictions

that

aretooextrem

eor

positiveassociations

that

arefalse.

7.The

user

ofthemodel

relieson

predictionsob-

tained

byextrapolatingto

combinationsof

predic-

torvalues

welloutsidetherangeof

thedataset

used

todevelopthemodel.

8.Accurateanddiscriminatingpredictionscanlead

CHAPTER

1.

INTRODUCTIO

N13

tobehaviorchangesthat

makefuture

predictions

inaccurate.

Iezzoni68lists

thesedimensionsto

capture,

forpa-

tientoutcom

estudies:

1.age

2.sex

3.acuteclinicalstability

4.principaldiagnosis

5.severity

ofprincipaldiagnosis

6.extent

andseverity

ofcomorbidities

7.physicalfunctionalstatus

8.psychological,

cognitive,

andpsychosocial

func-

tioning

9.cultural,ethnic,andsocioeconomicattributes

and

behaviors

10.healthstatus

andqualityof

life

11.patient

attitudesandpreferencesforoutcom

es

CHAPTER

1.

INTRODUCTIO

N14

Generalaspects

tocapturein

thepredictors:

1.baselinemeasurementof

response

variable

2.currentstatus

3.trajectory

asof

timezero,or

past

levelsof

akey

variable

4.variablesexplaining

muchof

thevariationin

the

response

5.moresubtlepredictorswhosedistributionsstrongly

differ

betweenlevelsof

thekeyvariableof

interest

inan

observationalstudy

1.5

Choiceofth

eM

odel

�In

biostatisticsandepidem

iology

andmostother

areasweusually

choose

modelem

pirically

�Modelmustusedata

efficiently

�Shouldmodel

overallstructure(e.g.,

acutevs.

chronic)

�Robustmodelsarebetter

CHAPTER

1.

INTRODUCTIO

N15

�Shouldhave

correctmathematicalstructure(e.g.,

constraintson

probabilities)

1.6

Modelunce

rtainty

/Data-d

riven

ModelSpecifica

tion

�Standarderrors,C.L.,P-values,R2wrong

ifcom-

putedas

ifthemodelpre-specified

�Stepw

isevariableselectioniswidelyused

andabused

�Bootstrap

canbeused

torepeatallanalysissteps

toproperlypenalizevariances,etc.

�Ye125 :

“generalized

degreesof

freedom”(GDF)

forany“data

mining”or

modelselectionprocedure

basedon

leastsquares

–Example:

20candidatepredictors,n=

22,for-

wardstepwise,best5-variablemodel:GDF=14.1

–Example:

CART,10

candidatepredictors,n=

100,

19nodes:

GDF=76

�See

79foran

approach

involvingadding

noiseto

Yto

improvevariableselection

Chapter2

Genera

lAsp

ectsofFittingRegression

Models

2.1

Notation

forM

ultivariable

Regression

Models

�Weightedsum

ofasetof

independent

orpredictor

variables

�Interpretparametersandstateassumptions

bylin-

earizing

model

withrespectto

regression

coeffi

-cients

�Analysisofvariance

setups,interaction

effects,non-

lineareffects

�Examiningthe2regression

assumptions

16

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

17

Yresponse

(dependent)variable

XX1,X2,...,Xp–listof

predictors

ββ0,β1,...,βp–regression

coeffi

cients

β0

interceptparameter(optional)

β1,...,βpweights

orregression

coeffi

cients

Xβ

β0+β1X

1+...+

βpXp,X

0=1

Model:connection

betweenX

andY

C(Y|X

):property

ofdistribution

ofY

givenX,

e.g.

C(Y|X

)=E(Y|X

)or

Prob{Y

=1|X}.

2.2

ModelForm

ulations

Generalregression

model

C(Y|X

)=g(X

).

Generallinearregression

model

C(Y|X

)=g(X

β).

Examples

C(Y|X

)=

E(Y|X

)=

Xβ,

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

18

Y|X

∼N(X

β,σ

2 )

C(Y|X

)=

Prob{Y

=1|X}=

(1+exp(−

Xβ))−1

Linearize:h(C

(Y|X

))=Xβ,h

(u)=g−1 (u)

Example:

C(Y|X

)=Prob{Y

=1|X}=

(1+exp(−

Xβ))−1

h(u)=logit(u)=

log(

u

1−u)

h(C

(Y|X

))=

C′ (Y|X

)(link)

Generallinearregression

model:

C′ (Y|X

)=Xβ.

2.3

Interp

retingM

odelPara

meters

Supposethat

Xjislinearanddoesn’tinteract

with

otherX’s.

C′ (Y|X

)=

Xβ=β0+β1X

1+...+

βpXp

βj=

C′ (Y|X

1,X2,...,Xj+1,...,Xp)

−C′ (Y|X

1,X2,...,Xj,...,Xp)

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

19

Drop′from

C′andassumeC(Y|X

)isproperty

ofY

that

islinearlyrelatedto

weightedsum

ofX’s.

2.3.1

NominalPredictors

Nom

inal

(polytom

ous)

factor

withklevels:k−

1dummyvariables.

E.g.T=J,K

,L,M

:

C(Y|T

=J)=

β0

C(Y|T

=K)=

β0+β1

C(Y|T

=L)=

β0+β2

C(Y|T

=M

)=

β0+β3.

C(Y|T)=Xβ=β0+β1X

1+β2X

2+β3X

3,

where

X1=1if

T=K,0

otherwise

X2=1

ifT=L,0

otherwise

X3=1if

T=M,0

otherwise.

The

test

foranydifferencesin

theproperty

C(Y

)betweentreatm

ents

isH0:β1=β2=β3=0.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

20

2.3.2

Intera

ctions

X1andX2,

effectof

X1on

Ydepends

onlevel

ofX2.

Oneway

todescribeinteractionis

toadd

X3=X1X

2to

model:

C(Y|X

)=β0+β1X

1+β2X

2+β3X

1X2.

C(Y|X

1+1,X2)−

C(Y|X

1,X2)

=β0+β1(X1+1)

+β2X

2

+β3(X1+1)X2

−[β0+β1X

1+β2X

2+β3X

1X2]

=β1+β3X

2.

One-unitincrease

inX2on

C(Y|X

):β2+β3X

1.Worse

interactions:

IfX1isbinary,theinteractionmay

take

theform

ofadifference

inshape(and/ordistribution)of

X2

vs.C(Y

)depending

onwhether

X1=

0or

X1=

1(e.g.logarithm

vs.square

root).

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

21

2.3.3

Example:In

ference

foraSim

ple

Model

PostulatedthemodelC(Y|age,sex)=β0+

β1age+

β2(sex=

f)+β3age(sex=

f)where

sex=

fis

adummyindicatorvariableforsex=

female,i.e.,the

referencecellissex=

malea.

Modelassumes

1.ageislinearlyrelatedto

C(Y

)formales,

2.ageislinearlyrelatedto

C(Y

)forfemales,and

3.interactionbetweenageandsexissimple

4.whateverdistribution,variance,andindependence

assumptions

areappropriateforthemodel

being

considered.

Interpretationsof

parameters:

Param

eter

Meaning

β0

C(Y|age=

0,sex=

m)

β1

C(Y|age=

x+1,sex=

m)−C(Y|age=

x,sex

=m)

β2

C(Y|age=

0,sex=

f)−

C(Y|age=

0,sex=

m)

β3

C(Y|age=

x+1,sex=

f)−C(Y|age=

x,sex

=f)−

[C(Y|age=

x+1,sex=

m)−C(Y|age=

x,sex

=m)]

β3isthedifference

inslopes

(fem

ale–male).

aYouca

nalsoth

inkofth

elast

part

ofth

emodel

asbeingβ3X

3,whereX

3=

age×

I[sex

=f].

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

22

Whenahigh-order

effectsuch

asan

interactionef-

fect

isin

themodel,besure

tointerpretlow-order

effectsby

findingoutwhatmakes

theinteractionef-

fect

ignorable.

Inourexam

ple,theinteractioneffect

iszero

whenage=

0or

sexismale.

Hypothesesthat

areusually

inappropriate:

1.H0:β1=0:

Thistestswhether

ageisassociated

withY

formales

2.H0:β2=0:

Thistestswhether

sexisassociated

withY

forzero

year

olds

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

23

Moreuseful

hypothesesfollow.For

anyhypothesis

need

to

�Write

whatisbeing

tested

�Translate

toparameterstested

�Listthealternativehypothesis

�Not

forget

whatthetest

ispow

ered

todetect

–Testagainstnonzeroslopehasmaximum

pow

erwhenlinearity

holds

–Iftrue

relationship

ismonotonic,test

fornon-

flatness

willhave

somebutnotoptimalpow

er

–Testagainstaquadratic(parabolic)shapewill

have

somepow

erto

detect

alogarithmicshape

butnotagainstasine

waveover

manycycles

�Usefulto

write

e.g.

“Ha:ageisassociated

with

C(Y

),pow

ered

todetect

alinearrelationship”

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

24

MostUsefulTests

forLinearage×

sexModel

Nullor

AlternativeHypothesis

Mathem

atical

Statement

Effectof

ageisindependentof

sexor

H0:β3=

0Effectof

sexisindependentof

ageor

ageandsexareadditive

ageeff

ects

areparallel

ageinteractswithsex

Ha:β36=

0agemodifies

effectof

sex

sexmodifies

effectof

age

sexandagearenon-additive(synergistic)

ageisnot

associated

withY

H0:β1=

β3=

0ageisassociated

withY

Ha:β16=

0or

β36=

0ageisassociated

withY

foreither

females

ormales

sexisnot

associated

withY

H0:β2=

β3=

0sexisassociated

withY

Ha:β26=

0or

β36=

0sexisassociated

withY

forsome

valueof

age

Neither

agenor

sexisassociated

withY

H0:β1=

β2=

β3=

0Either

ageor

sexisassociated

withY

Ha:β16=

0or

β26=

0or

β36=

0

Note:The

last

test

iscalledtheglobal

test

ofno

association.

Ifan

interactioneffectpresent,thereis

bothan

ageandasexeffect.

There

canalso

beage

orsexeffects

whenthelines

areparallel.The

global

test

ofassociation(testof

totalassociation)

has3

d.f.insteadof

2(age

+sex)

because

itallowsfor

unequalslopes.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

25

2.4

Review

ofComposite

(Chunk)Tests

�In

themodel

y∼

age

+sex

+weight+

waist+

tric

ep

wemay

wantto

jointlytest

theassociationbe-

tweenallbodymeasurementsandresponse,hold-

ingageandsexconstant.

�This3d.f.test

may

beobtained

twoways:

–Rem

ovethe3variablesandcomputethechange

inSSR

orSSE

–TestH0:β3=

β4=

β5=

0usingmatrix

algebra(e.g.,

anova(fit,weight,

waist,tri-

cep)iffitisafitobject

createdby

theR

rms

package)

2.5

RelaxingLinearity

Assumption

forContinuousPredictors

2.5.1

AvoidingCategorization

�Relationships

seldom

linearexcept

whenpredicting

onevariablefrom

itselfmeasuredearlier

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

26

�Categorizingcontinuous

predictorsinto

intervalsis

adisaster1,2,4,8,21,44,46,64,66,73,81,84,93,96,99,108,115

�Som

eproblemscaused

bythisapproach:

1.Estimated

values

have

reducedprecision,

and

associated

testshave

reducedpow

er

2.Categorizationassumesrelationshipbetweenpre-

dictor

andresponse

isflat

withinintervals;

far

less

reasonable

than

alinearity

assumptionin

mostcases

3.Tomakeacontinuous

predictorbemoreac-

curately

modeled

whencategorization

isused,

multipleintervalsarerequired

4.Because

ofsamplesize

limitations

inthevery

low

andvery

high

rangeof

thevariable,the

outerintervals(e.g.,outerquintiles)willbewide,

resultinginsignificant

heterogeneityof

subjects

withinthoseintervals,andresidualconfounding

5.Categorizationassumes

that

thereisadiscon-

tinuityin

response

asinterval

boundariesare

crossed.

Other

than

theeffectof

time(e.g.,an

instantstockpricedrop

afterbadnews),there

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

27

arevery

fewexam

ples

inwhich

such

discontinu-

itieshave

beenshow

nto

exist.

6.Categorizationonlyseem

sto

yieldinterpretable

estimates.

E.g.odds

ratioforstroke

forper-

sons

withasystolicbloodpressure>

160mmHg

comparedto

persons

withabloodpressure≤

160mmHg→

interpretation

ofOR

depends

ondistribution

ofbloodpressuresin

thesam-

ple(the

proportion

ofsubjects

>170,

>180,

etc.).

Ifbloodpressure

ismodeled

asacon-

tinuousvariable(e.g.,usingaregression

spline,

quadratic,or

lineareffect)onecanestimatethe

ratioof

odds

forexactsettings

ofthepredictor,

e.g.,theodds

ratiofor200mmHgcomparedto

120mmHg.

7.Categorizationdoes

notconditionon

fullinfor-

mation.

When,

forexam

ple,therisk

ofstroke

isbeing

assessed

foranewsubjectwithaknow

nbloodpressure

(say

162mmHg),thesubject

does

notreportto

herphysician“m

ybloodpres-

sureexceeds160”butratherreports162mmHg.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

28

The

risk

forthissubjectwillbemuchlowerthan

that

ofasubjectwithabloodpressure

of200

mmHg.

8.Ifcutpointsaredeterm

ined

inaway

that

isnot

blindedto

theresponse

variable,calculationof

P-valuesandconfidenceintervalsrequires

spe-

cial

simulationtechniques;ordinary

inferential

methods

arecompletelyinvalid.E.g.:cutpoints

chosen

bytrialand

errorutilizing

Y,even

infor-

mally→

P-valuestoosm

allandCLsnotaccu-

rateb.

9.Categorizationnotblindedto

Y→

biased

effect

estimates

4,99

10.“Optimal”cutpointsdo

notreplicateover

stud-

ies.

Hollander

etal.66

statethat“...theoptimal

cutpoint

ap-

proach

hasdisadvantages.

One

oftheseisthat

inalmosteverystudy

where

thismethodisapplied,

anothercutpoint

will

emerge.This

makes

comparisons

across

studiesextrem

elydifficultor

even

impos-

sible.

Altman

etal.point

outthisproblem

forstudiesof

theprog-

nosticrelevanceof

theS-phase

fraction

inbreastcancer

publishedin

theliterature.

Theyidentified

19differentcutpointsused

inthelit-

bIf

acu

tpointis

chosenth

atminim

izes

theP-valueandth

eresu

ltingP-valueis

0.05,th

etruetypeIerrorca

nea

sily

beabove0.5

66.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

29

erature;someof

them

weresolelyused

because

they

emergedas

the

’optimal’cutpoint

inaspecificdata

set.

Inameta-analysison

the

relationship

betweencathepsin-D

contentanddisease-free

survival

innode-negativebreast

cancer

patients,12

studieswerein

included

with12

differentcutpoints...Interestingly,neithercathepsin-D

nor

theS-phasefraction

arerecommendedto

beused

asprognosticmark-

ersin

breast

cancer

intherecent

update

oftheAmerican

Society

of

ClinicalOncology.”

11.D

isagreem

entsincutpoints(w

hich

arebound

tohappen

wheneveronesearchesforthings

thatdo

notexist)

causesevere

interpretation

problems.

One

studymay

providean

odds

ratioforcom-

paring

bodymassindex(BMI)>

30withBMI

≤30,anotherforcomparing

BMI>

28with

BMI≤

28.Neither

ofthesehasagood

defini-

tion

andthetwoestimates

arenotcomparable.

12.C

utpointsarearbitraryandmanipulatable;cut-

pointscanbefoundthat

canresultinbothpos-

itiveandnegative

associations

115 .

13.Ifaconfounderisadjusted

forby

categorization,

therewill

beresidual

confoundingthat

canbe

explainedaw

ayby

inclusionof

thecontinuous

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

30

form

ofthepredictorin

themodel

inaddition

tothecategories.

�Tosummarize:

The

useof

a(single)

cutpoint

cmakes

manyassumptions,including:

1.RelationshipbetweenX

andY

isdiscontinuous

atX

=candonlyX

=c

2.ciscorrectlyfoundas

thecutpoint

3.X

vs.Y

isflat

totheleftof

c

4.X

vs.Y

isflat

totherightof

c

5.The

choice

ofcdoes

notdependon

thevalues

ofotherpredictors

2.5.2

Sim

ple

NonlinearTerm

s

C(Y|X

1)=β0+β1X

1+β2X

2 1.

�H0:model

islinearin

X1vs.

Ha:model

isquadraticin

X1≡

H0:β2=0.

�Testof

linearity

may

bepow

erfuliftrue

modelis

notextrem

elynon-parabolic

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

31

�Predictions

notaccurate

ingeneralas

manyphe-

nomenaarenon-quadratic

�Can

getmoreflexiblefitsby

adding

pow

ershigher

than

2

�But

polynom

ials

donotadequately

fitlogarith-

micfunctionsor“threshold”effects,andhave

un-

wantedpeaks

andvalleys.

2.5.3

SplinesforEstim

atingShapeofRegressionFunctionandDeterm

iningPredictor

Tra

nsform

ations

Dra

ftman’s

spline:flexiblestripofmetalorrub-

ber

used

totracecurves.

SplineFunction:piecew

isepolynom

ial

Lin

earSplineFunction:piecew

iselinearfunc-

tion

�Bilinearregression:modelisβ0+

β1X

ifX≤

a,

β2+β3X

ifX

>a.

�Problem

withthisnotation:twolines

notcon-

strained

tojoin

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

32

�Toforcesimplecontinuity:β0+β1X

+β2(X−

a)×

I[X

>a]=

β0+β1X

1+β2X

2,where

X2=(X

1−a)×I[X

1>

a].

�Slopeisβ1,X≤

a,β1+β2,X

>a.

�β2istheslopeincrem

entas

youpass

a

Moregenerally:X-axisdividedinto

intervalswith

endpointsa,b,c

(knots).

f(X

)=

β0+β1X

+β2(X−a) +

+β3(X−b)+

+β4(X−c)+,

where

(u) +

=u,u>

0,

0,u≤

0.

f(X

)=β0+β1X,

X≤

a

=β0+β1X

+β2(X−

a)

a<

X≤

b

=β0+β1X

+β2(X−

a)+β3(X−b)

b<

X≤

c

=β0+β1X

+β2(X−

a)

+β3(X−b)+β4(X−

c)c<

X.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

33

X

f(X)

01

23

45

6

Figure

2.1:

Alinearsplinefunctionwithknots

ata=

1,b=

3,c=

5.

C(Y|X

)=f(X

)=Xβ,

where

Xβ=

β0+β1X

1+β2X

2+β3X

3+β4X

4,and

X1=X

X2=(X−a) +

X3=(X−

b)+

X4=(X−c)+.

Overalllinearity

inX

canbetested

bytestingH0:

β2=β3=β4=0.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

34

2.5.4

Cubic

SplineFunctions

Cubicsplines

aresm

ooth

atknots(function,firstand

second

derivativesagree)

—can’tseejoins.

f(X

)=

β0+β1X

+β2X

2+β3X

3

+β4(X−a)3 +

+β5(X−b)3 ++β6(X−c)3 +

=Xβ

X1=X

X2=X

2

X3=X

3X4=(X−

a)3 +

X5=(X−b)3 +

X6=(X−c)3 +.

kknots→

k+3coeffi

cients

excludingintercept.

X2andX

3term

smustbeincluded

toallownonlin-

earity

whenX

<a.

2.5.5

Restricted

Cubic

Splines

Stone

andKoo

107 :

cubicsplines

poorlybehaved

intails.Constrain

function

tobelinearin

tails.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

35

k+3→

k−

1parameters3

7 .

Toforcelinearity

whenX

<a:X

2andX

3term

smustbeom

itted

Toforcelinearity

whenX

>last

knot:last

twoβs

areredundant,i.e.,arejustcombinationsoftheother

βs.

The

restricted

splinefunction

withkknotst 1,...,tk

isgivenby

37

f(X

)=β0+β1X

1+β2X

2+...+

βk−1X

k−1,

where

X1=X

andforj=1,...,k−2,

Xj+

1=

(X−t j)3 +−(X−

t k−1)3 +(t

k−t j)/(t

k−t k−1)

+(X−t k)3 +(t

k−1−t j)/(t

k−t k−1).

Xjislinearin

XforX≥

t k.

require(Hmisc)

x←

rcspline.e

val(se

q(0,1

,.0

1),

knots=se

q(.05,.95,length

=5),

inclx=TRUE)

xm←

xxm

[xm

>.0106]←

NA

matp

lot(x[,1

],

xm,

type=

”l”,

ylim=c(0,.0

1),

xlab=expre

ssio

n(X

),

ylab=

'',

lty=1)

matp

lot(x[,1

],

x,

type=

”l”,

xlab=expre

ssio

n(X

),

ylab=

'',

lty=1)

x←

seq(0,

1,

length

=300)

for(nk

in3:6

){

set.seed(nk)

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

36

0.0

0.2

0.4

0.6

0.8

1.0

0.0000.0040.008

X

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

X

Figure

2.2:

Restrictedcubic

splinecomponentvariablesfork=

5andknots

atX

=.05,.275,.5,.725,and.95.

Theleft

panel

isay–magnificationoftherightpanel.Fittedfunctionssuch

asthose

inFigure

2.3

willbelinear

combinationsofthesebasisfunctionsaslongasknots

are

atthesamelocationsusedhere.

knots←

seq(.05,

.95,

length

=nk)

xx←

rcspline.e

val(x,

knots=knots

,in

clx=TRUE)

for(i

in1:(nk−

1))

xx[,i]←

(xx[,i]−

min(xx[,i])

)/

(max

(xx[,i])−

min(xx[,i])

)fo

r(i

in1:2

0)

{beta

←2*runif(nk−

1)−

1xbeta←

xx%

*%beta

+2*runif

(1)−

1xbeta←

(xbeta−

min(xbeta

))

/(max

(xbeta

)−

min(xbeta

))

if(i=

=1)

{plo

t(x,

xbeta

,ty

pe=

”l”,

lty=1,

xlab=expre

ssio

n(X

),

ylab=

'',

bty=”l”)

tit

le(su

b=paste

(nk,”knots

”),

adj=

0,

cex=

.75)

for(j

in1:n

k)

arrows(knots

[j],

.04,

knots

[j],−.03,

angle=20,

length

=.07,

lwd=1.5

)}

else

lines(x,

xbeta

,col=

i)

}}

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

37

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

X3

knot

s

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

X4

knot

s

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

X5

knot

s

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

X6

knot

s

Figure

2.3:

Sometypicalrestricted

cubicsplinefunctionsfork=

3,4,5,6.They–axisisXβ.Arrow

sindicate

knots.

Thesecurves

werederived

byrandomly

choosingvalues

ofβsubject

tostandard

deviationsoffitted

functionsbeing

norm

alized.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

38

Onceβ0,...,βk−1areestimated,therestricted

cu-

bicsplinecanberestated

intheform

f(X

)=

β0+β1X

+β2(X−t 1)3 +

+β3(X−t 2)3 +

+...+

βk+1(X−t k)3 +

bycomputing

βk

=[β

2(t

1−t k)+β3(t

2−t k)+...

+βk−1(t

k−2−t k)]/(t k−

t k−1)

βk+1

=[β

2(t

1−t k−1)+β3(t

2−t k−1)+...

+βk−1(t

k−2−t k−1)]/(t k−1−

t k).

Atest

oflinearity

inXcanbeobtained

bytesting

H0:β2=β3=...=βk−1=0.

2.5.6

ChoosingNumberandPosition

ofKnots

�Knotsarespecified

inadvanceinregression

splines

�Locations

notimportant

inmostsituations

39,106

�Place

knotswhere

data

exist—

fixedquantilesof

predictor’smarginaldistribution

�Fitdepends

moreon

choice

ofk

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

39

kQuantiles

3.10

.5.90

4.05

.35

.65

.95

5.05

.275

.5.725

.95

6.05

.23

.41

.59

.77

.95

7.025

.1833

.3417

.5.6583

.8167

.975

n<

100–replaceouterquantileswith5thsm

allest

and5thlargestX

107 .

Choiceof

k:

�Flexibilityof

fitvs.nandvariance

�Usuallyk=3,4,5.

Often

k=4

�Large

n(e.g.n≥

100)

–k=5

�Smalln(<

30,say)

–k=3

�Can

useAkaike’sinform

ationcriterion(AIC)5

,111

tochoose

k

�Thischooseskto

maximizemodellikelihoodratio

χ2−2k

.

See

51foracomparisonof

restricted

cubicsplines,

fractionalpolynom

ials,andpenalized

splines.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

40

2.5.7

Nonpara

metric

Regression

�Estimatetendency

(meanor

median)

ofY

asa

function

ofX

�Few

assumptions

�Especially

handywhenthereisasingleX

�Plotted

trendlinemay

bethefinalresultof

the

analysis

�Simplestsm

oother:movingaverage

X:

12

35

8Y:2.13.85.711.1

17.2

E(Y|X

=2)

=2.1+3.8+5.7

3

E(Y|X

=2+3+5

3)=

3.8+5.7+11.1

3–overlapOK

–problem

inestimatingE(Y

)at

outerX-values

–estimates

very

sensitiveto

binwidth

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

41

�Movinglinearregression

farsuperiorto

moving

avg.

(movingflat

line)

�Cleveland’s

27movinglinearregression

smoother

loess(locally

weightedleastsquares)

isthemost

popular

smoother.Toestimatecentraltendency

ofY

atX

=x:

–take

allthedata

having

Xvalues

withinasuit-

ableintervalaboutx(defaultis

2 3of

thedata)

–fitweightedleastsquareslinearregression

within

thisneighborhood

–pointsnear

xgiventhemostweightc

–pointsnear

extrem

esof

interval

receivealmost

noweight

–loessworks

muchbetterat

extrem

esof

Xthan

movingavg.

–provides

anestimateat

each

observed

X;other

estimates

obtained

bylinearinterpolation

–outlierrejectionalgorithm

built-in

�loessworks

greatforbinary

Y—

just

turn

offoutlierdetection

cW

eightheremea

nssomething

differen

tth

an

regression

coeffi

cien

t.It

mea

nshow

much

apointis

emphasized

indev

eloping

the

regressionco

efficien

ts.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

42

�Otherpopularsm

oother:Friedman’s“supersm

oother”

�For

loessor

supsmuam

ount

ofsm

oothingcanbe

controlledby

analyst

�Another

alternative:

smoothingsplinesd

�Smoothersarevery

usefulforestimatingtrends

inresidualplots

2.5.8

AdvantagesofRegressionSplinesoverOth

erM

eth

ods

Regressionsplines

have

severaladvantages

60:

�Param

etricsplines

canbefitted

usinganyexisting

regression

program

�Regressioncoeffi

cients

estimated

usingstandard

techniques

(MLor

leastsquares),form

altestsof

nooverallassociation,linearity,and

additivity,con-

fidencelim

itsfortheestimated

regression

function

arederivedby

standard

theory.

�The

fitted

function

directlyestimates

transforma-

tion

predictorshould

receiveto

yieldlinearity

indTheseplace

knots

atallth

eobserved

data

points

butpen

alize

coeffi

cien

testimatestoward

ssm

ooth

ness.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

43

C(Y|X

).

�Evenwhenasimpletransformationis

obvious,

splinefunction

canbeused

torepresentthepredic-

torinthefinalm

odel(and

thed.f.willbecorrect).

Nonparametricmethods

donotyieldaprediction

equation.

�Extension

tonon-additive

models.

Multi-dimensionalnonparam

etricestimatorsoften

requireburdensomecomputations.

2.6

Recu

rsivePartitioning:Tree-B

ased

Models

Breiman,Friedman,Olshen,

andStone

18:CART

(Classification

andRegressionTrees)—

essentially

model-free

Method:

�Findpredictorso

that

bestpossiblebinarysplit

has

maximum

valueof

somestatisticforcomparing

2groups

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

44

�Withinpreviously

form

edsubsets,

find

bestpre-

dictor

andsplit

maximizingcriterionin

thesubset

�Proceed

inlikefashionuntil<

kobs.

remainto

split

�Sum

marizeY

fortheterm

inal

node

(e.g.,mean,

modalcategory)

�Prune

tree

backwarduntilitcross-validates

aswell

asits“apparent”accuracy,or

useshrinkage

Advantages/disadvantagesof

recursivepartitioning:

�Doesnotrequirefunctionalform

forpredictors

�Doesnotassumeadditivity

—canidentify

com-

plex

interactions

�Can

dealwithmissing

data

flexibly

�Interactions

detected

arefrequentlyspurious

�Doesnotusecontinuous

predictorseffectively

�Penalty

foroverfittingin

3directions

�Often

tree

doesn’tcross-validateoptimallyunless

pruned

back

very

conservatively

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

45

�Veryuseful

inmessy

situations

orthosein

which

overfittingisnotas

problematic

(confounderad-

justmentusingpropensity

scores

28;missing

value

imputation)

See

7 .

2.7

New

Directionsin

PredictiveM

odeling

The

approaches

recommendedin

thiscourse

are

�fittingfully

pre-specified

modelswithout

deletion

of“insignificant”predictors

�usingdata

reductionmethods

(maskedto

Y)to

reduce

thedimensionalityof

thepredictors

and

then

fittingthenumber

ofparametersthedata’s

inform

ationcontentcansupport

�useshrinkage(penalized

estimation)

tofitalarge

modelwithout

worryingaboutthesamplesize.

The

data

reductionapproach

canyieldvery

inter-

pretable,stable

models,

buttherearemanydeci-

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

46

sionsto

bemadewhenusingatwo-stage(reduc-

tion/m

odelfitting)

approach,New

erapproaches

are

evolving,includingthefollowing.

These

new

ap-

proach

handlecontinuous

predictors

well,unlikere-

cursivepartitioning.

�lasso(shrinkage

usingL1norm

favoring

zero

re-

gression

coeffi

cients)1

05,110

�elasticnet(com

bination

ofL1andL2norm

sthat

handlesthep>

ncase

betterthan

thelasso)

129

�adaptive

lasso116,127

�moreflexiblelassoto

differentiallypenalizeforvari-

ableselectionandforregression

coeffi

cientestima-

tion

92

�grouplassoto

forceselectionof

allor

none

ofa

groupof

relatedvariables(e.g.,dummyvariables

representing

apolytom

ouspredictor)

�grouplasso-likeprocedures

that

also

allowforvari-

ableswithinagroupto

beremoved

117

�adaptive

grouplasso(W

ang&

Leng)

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

47

�Breiman’snonnegativegarrote124

�“preconditioning”,

i.e.,model

simplification

after

developing

a“black

box”predictive

model87

�sparse

principalcomponents

analysis

toachieve

parsimonyin

data

reduction77,78,121,128

�bagging,

boosting,

andrandom

forests6

2

One

problem

prevents

mostof

thesemethods

from

being

readyforeveryday

use:

they

requirescaling

predictors

beforefittingthemodel.Whenapredic-

torisrepresentedby

nonlinearbasisfunctions,

the

scalingrecommendationsin

theliteraturearenot

sensible.

There

arealso

computational

issues

and

difficultiesobtaininghypothesistestsandconfidence

intervals.

Whendata

reductionisnotrequired,generalized

ad-

ditive

models6

3,122should

also

beconsidered.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

48

2.8

Multiple

DegreeofFreedom

TestsofAssociation

C(Y|X

)=β0+β1X

1+β2X

2+β3X

2 2,

H0:β2=

β3=

0with2d.f.

toassess

association

betweenX2andoutcom

e.

Inthe5-knot

restricted

cubicsplinemodel

C(Y|X

)=β0+β1X

+β2X′+β3X′′+β4X′′′ ,

H0:β1=...=β4=0

�Testof

association:

4d.f.

�Insignificant→

dangerousto

interpretplot

�Whatto

doif4d.f.

test

insignificant,3d.f.

test

forlinearity

insig.,1d.f.

test

sig.

afterdelete

nonlinearterm

s?

Grambsch

andO’Brien

52elegantlydescribed

thehaz-

ards

ofpretesting

�Studied

quadraticregression

�Showed

2d.f.test

ofassociationis

nearly

opti-

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

49

mal

even

whenregression

islinearifnonlinearity

entertained

�Consideredordinary

regression

model

E(Y|X

)=β0+β1X

+β2X

2

�Twowaysto

test

associationbetweenX

andY

�Fitquadraticmodel

andtest

forlinearity

(H0:

β2=0)

�F-testforlinearity

significant

atα

=0.05

level

→reportas

thefinaltestof

associationthe2d.f.

Ftest

ofH0:β1=β2=0

�Ifthetest

oflinearity

insignificant,refitwithout

thequadraticterm

andfinaltest

ofassociationis

1d.f.test,H0:β1=0|β2=0

�Showed

that

typeIerror>

α

�Fairlyaccurate

P-value

obtained

byinsteadtest-

ingagainstF

with2d.f.even

atsecond

stage

�Cause:areretainingthemostsignificant

part

ofF

�BUT

iftest

against2d.f.canonly

lose

pow

erwhencomparedwithoriginal

Ffortestingboth

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

50

βs

�SSR

from

quadraticmodel

>SSR

from

linear

model

2.9

AssessmentofM

odelFit

2.9.1

RegressionAssumptions

The

generallinearregression

modelis

C(Y|X

)=Xβ=β0+β1X

1+β2X

2+...+

βkXk.

Verifylinearity

andadditivity.Specialcase:

C(Y|X

)=β0+β1X

1+β2X

2,

where

X1isbinary

andX2iscontinuous.

X2

C(Y)

X1

=0

X1

=1

Figure

2.4:

Regressionassumptionsforonebinary

andonecontinuouspredictor.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

51

Methods

forchecking

fit:

1.Fitsimplelinearadditive

modelandcheckexam

ine

residualplotsforpatterns

�For

OLS:box

plotsof

estratified

byX1,

scat-

terplots

ofevs.X2andY,withtrendcurves

(wantflat

centraltendency,constant

variability)

�For

norm

ality,

qqnormplotsof

overallandstrat-

ified

residuals

Advantage:Simplicity

Disadvantages:

�Can

onlycompute

standard

residualsforuncen-

soredcontinuous

response

�Subjectivejudgmentof

non-random

ness

�Hardto

handleinteraction

�Hardto

seepatterns

withlargen(trend

lines

help)

�Seeingpatterns

does

notlead

tocorrective

ac-

tion

2.Scatterplot

ofY

vs.X2usingdifferentsymbols

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

52

accordingto

values

ofX1

Advantages:

Simplicity,canseeinteraction

Disadvantages:

�Scatterplotscannot

bedraw

nforbinary,cate-

gorical,or

censored

Y

�Patternsdifficultto

seeifrelationshipsareweak

ornlarge

3.Stratifythesampleby

X1andquantile

groups

(e.g.deciles)

ofX2;

estimateC(Y|X

1,X2)

for

each

stratum

Advantages:

Simplicity,canseeinteractions,han-

dles

censored

Y(ifyouarecareful)

Disadvantages:

�Requireslargen

�Doesnotusecontinuous

var.effectively(noin-

terpolation)

�Subgroupestimates

have

lowprecision

�Dependent

onbinningmethod

4.Separatelyforlevels

ofX1fitanonparam

etric

smoother

relating

X2to

Y

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

53

Advantages:

Allregression

aspectsof

themodel

canbesummarized

efficientlywithminimal

as-

sumptions

Disadvantages:

�Doesnotapplyto

censored

Y

�Hardto

dealwithmultiplepredictors

5.Fitflexiblenonlinearparametricmodel

Advantages:

�One

fram

eworkforexam

iningthemodelassump-

tions,

fittingthemodel,draw

ingform

alinfer-

ence

�d.f.definedandallaspects

ofstatisticalinfer-

ence“w

orkas

advertised”

Disadvantages:

�Com

plexity

�Generallydifficultto

allowforinteractions

when

assessingpatterns

ofeffects

Confidencelim

its,form

alinferencecanbeproblem-

aticformethods

1-4.

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

54

Restrictedcubicsplineworks

wellformethod5.

C(Y|X

)=

β0+β1X

1+β2X

2+β3X′ 2+β4X′′ 2

=β0+β1X

1+f(X

2),

where

f(X

2)=β2X

2+β3X′ 2+β4X′′ 2,

f(X

2)spline-estimated

transformationof

X2.

�Plotf(X

2)vs.X2

�nlarge→

canfitseparate

functionsby

X1

�Testof

linearity:H0:β3=β4=0

�Nonlinear→

usetransformationsuggestedby

spline

fitor

keep

splineterm

s

�Tentative

transformationg(X

2)→

checkadequacy

byexpandingg(X

2)insplinefunction

andtesting

linearity

�Can

find

transformations

byplotting

g(X

2)vs.

f(X

2)forvarietyof

g

�Multiplecontinuous

predictors→

expand

each

us-

ingspline

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

55

�Example:

assess

linearity

ofX2,X3

C(Y|X

)=

β0+β1X

1+β2X

2+β3X′ 2+β4X′′ 2

+β5X

3+β6X′ 3+β7X′′ 3,

Overalltest

oflinearity

H0:β3=

β4=

β6=

β7=

0,with4d.f.

2.9.2

ModelingandTestingComplexIn

tera

ctions

X1binary

orlinear,X2continuous:

C(Y|X

)=

β0+β1X

1+β2X

2+β3X′ 2+β4X′′ 2

+β5X

1X2+β6X

1X′ 2+β7X

1X′′ 2

Simultaneoustest

oflinearity

andadditivity:H0:

β3=...=β7=0.

�2continuous

variables:

couldtransformseparately

andform

simpleproduct

�Transform

ations

dependon

whether

interaction

term

sadjusted

for

�Fitinteractions

oftheform

X1f(X

2)andX2g(X

1):

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

56

C(Y|X

)=

β0+β1X

1+β2X′ 1+β3X′′ 1

+β4X

2+β5X′ 2+β6X′′ 2

+β7X

1X2+β8X

1X′ 2+β9X

1X′′ 2

+β10X2X′ 1+β11X2X′′ 1

�Testof

additivity

isH0:β7=

β8=

...=

β11

=0with5d.f.

�Testof

lack

offitforthesimpleproductinterac-

tion

withX2isH0:β8=β9=0

�Testof

lack

offitforthesimpleproductinterac-

tion

withX1isH0:β10

=β11

=0

Generalsplinesurface:

�Cover

X1×X2planewithgridandfitpatch-wise

cubicpolynom

ialin

twovariables

�Restrictto

beof

form

aX1+

bX2+

cX1X

2in

corners

�Usesall(k−1)2cross-productsof

restricted

cubic

splineterm

s

�See

Gray[53,54,Section

3.2]forpenalized

splines

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

57

allowingcontrolof

effective

degreesof

freedom.

See

Berhane

etal.12foragood

discussion

often-

sorsplines.

Other

issues:

�Y

non-censored

(especially

continuous)→

multi-

dimensionalscatterplotsm

oother

22

�Interactions

oforder>

2:moretrouble

�2-way

interactions

amongppredictors:

pooled

tests

�ptestseach

withp−1d.f.

Som

etypes

ofinteractions

topre-specifyin

clinical

studies:

�Treatment×

severity

ofdiseasebeing

treated

�Age×

risk

factors

�Age×

typeof

disease

�Measurement×

stateof

asubjectduring

mea-

surement

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

58

�Race×

disease

�Calendartime×

treatm

ent

�Quality×

quantity

ofasymptom

2.9.3

FittingOrd

inalPredictors

�Smallno.categories

(3-4)→

polytom

ousfactor,

dummyvariables

�Designmatrixforeasy

test

ofadequacy

ofinitial

codes→

koriginalcodes+

k−2dummies

�Morecategories→

scoreusingdata-driventrend.

Later

testsusek−1d.f.insteadof

1d.f.

�E.g.,compute

logit(mortality)

vs.category

2.9.4

DistributionalAssumptions

�Som

emodels(e.g.,logistic):

allassumptions

inC(Y|X

)=

Xβ(implicitly

assumingno

omitted

variables!)

�Linearregression:Y∼

Xβ+ǫ,ǫ∼

n(0,σ

2 )

�Examinedistribution

ofresiduals

CHAPTER

2.

GENERALASPECTSOFFIT

TIN

GREGRESSIO

NMODELS

59

�Som

emodels(W

eibull,

Cox

31):

C(Y|X

)=C(Y

=y|X

)=d(y)+Xβ

C=

loghazard

�Check

form

ofd(y)

�Showd(y)does

notinteract

withX

Chapter3

Multivariable

ModelingStrategies

�“Spending

d.f.”:exam

iningor

fittingparametersin

models,or

exam

iningtables

orgraphs

that

utilize

Yto

tellyouhowto

modelvariables

�Ifwishto

preserve

statisticalproperties,can’tre-

trieve

d.f.once

they

are“spent”(see

Grambsch

&O’Brien)

�Ifascatterplotsuggestslinearity

andyoufitalin-

earmodel,howmanyd.f.didyouactuallyspend

(i.e.,thed.f.that

whenputinto

aform

ularesults

inaccurate

confidencelim

itsor

P-values)?

�Decidenumber

ofd.f.that

canbespent

60

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S61

�Decidewhere

tospendthem

�Spendthem

3.1

Presp

ecifica

tion

ofPredictorComplexity

WithoutLaterSim

plifi-

cation

�Rarelyexpectlinearity

�Can’talwaysusegraphs

orotherdevicesto

choose

transformation

�Ifselect

from

amongmanytransformations,re-

sultsbiased

�Needto

allow

flexible

nonlinearity

topotentially

strong

predictorsnotknow

nto

predictlinearly

�Oncedecide

apredictoris“in”

canchoose

no.of

parametersto

devote

toitusingageneralassoci-

ationindexwithY

�Needameasure

of“potentialpredictive

punch”

�Measure

needsto

maskanalystto

true

form

ofregression

topreserve

statisticalproperties

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S62

3.1.1

Learn

ingFro

maSatu

ratedM

odel

Whentheeffective

samplesize

availableissufficiently

largeso

that

asaturatedmaineffectsmodelmay

be

fitted,agood

approach

togaugingpredictive

poten-

tialisthefollowing.

�Let

allcontinuous

predictorsberepresentedas

re-

stricted

cubicsplines

withkknots,where

kisthe

maximum

number

ofknotstheanalystentertains

forthecurrentproblem.

�Let

allcategoricalpredictors

retain

theiroriginal

categories

except

forpoolingof

very

low

preva-

lencecategories

(e.g.,ones

containing

<6obser-

vations).

�Fitthisgeneralmaineffects

model.

�Com

pute

thepartialχ2statisticfortestingthe

associationof

each

predictorwiththeresponse,

adjusted

forallotherpredictors.In

thecase

ofordinary

regression

convertpartialF

statistics

toχ2statistics

orpartialR2values.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S63

�Makecorrectionsforchance

associations

to“level

theplayingfield”forpredictorshaving

greatlyvary-

ingd.f.,e.g.,subtract

thed.f.from

thepartialχ2

(the

expectedvalueof

χ2 pispunderH0).

�Makecertainthat

testsof

nonlinearity

arenotre-

vealed

asthiswould

bias

theanalyst.

�Sortthepartialassociation

statistics

indescending

order.

Com

mands

inthermspackagecanbeused

toplot

onlywhatisneeded.Hereisan

exam

pleforalogistic

model.

f←

lrm(y∼

sex

+ra

ce

+rc

s(age,5

)+

rcs(weight,5

)+

rcs(height,5

)+

rcs(blo

od.p

ressure

,5))

plo

t(anova(f))

3.1.2

UsingM

arg

inalGenera

lize

dRankCorrelations

Whencollinearitiesor

confoundingarenotproblem-

atic,aquickerapproach

basedon

pairwisemeasures

ofassociationcanbeuseful.Thisapproach

willnot

have

numerical

problems(e.g.,singular

covariance

matrix)

andisbasedon:

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S64

�2d.f.generalizationof

Spearm

anρ—R2basedon

rank(X

)andrank(X

)2vs.rank(Y

)

�ρ2candetect

U-shaped

relationships

�For

categoricalX,ρ2is

R2from

dummyvari-

ablesregressedagainstrank(Y

);this

istightly

relatedto

theWilcoxon–M

ann–Whitney–K

ruskal–

Wallis

rank

test

forgroupdifferencesa

�Sortvariablesby

descending

orderof

ρ2

�Specifynumber

ofknotsforcontinuous

X,com-

bine

infrequent

categories

ofcategoricalX

based

onρ2

Allocating

d.f.basedon

partialtestsof

association

orsortingρ2isafairprocedurebecause

�Wealreadydecidedto

keep

variablein

modelno

matterwhatρ2or

χ2values

areseen

�ρ2andχ2do

notreveal

degree

ofnonlinearity;

high

valuemay

beduesolelyto

strong

lineareffect

�lowρ2or

χ2foracategoricalvariablemight

lead

aThis

test

statistic

does

notinform

theanalyst

ofwhichgroupsare

differen

tfrom

oneanoth

er.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S65

tocollapsingthemostdisparatecategories

Initialsimulations

show

theprocedureto

beconser-

vative.Notethat

onecanmovefrom

simplerto

more

complex

modelsbutnottheotherway

round

3.2

Check

ingAssumptionsofM

ultiple

Predictors

Sim

ultaneously

�Som

etimesfailureto

adjustforothervariablesgives

wrong

transformationof

anX,or

wrong

signifi-

canceof

interactions

�Som

etimes

unwieldyto

deal

simultaneouslywith

allpredictors

ateach

stage→

assess

regression

assumptions

separatelyforeach

predictor

3.3

Variable

Selection

�Seriesof

potentialpredictorswithno

priorknow

l-edge

�↑exploration→↑shrinkage(overfitting)

�Sum

maryofproblem:E(β|β“significant”)6=β24

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S66

�BiasedR2 ,β,standard

errors,P-valuestoosm

all

�F

andχ2statistics

donothave

theclaimed

dis-

tribution52

�Will

resultin

residual

confoundingifusevariable

selectionto

find

confounders5

6

�Derksen

andKeselman

36foundthat

instepwise

analyses

thefinalmodel

representednoise0.20-

0.74

oftime,

finalmodel

usually

contained<

1 2actualnumber

ofauthenticpredictors.Also:

1.“T

hedegree

ofcorrelationbetweenthepre-

dictor

variablesaffectedthefrequencywith

which

authenticpredictorvariablesfound

theirway

into

thefinalmodel.

2.The

numberofcandidatepredictorvariables

affectedthenumber

ofnoisevariablesthat

gained

entryto

themodel.

3.The

size

ofthesamplewas

oflittlepracti-

calimportancein

determ

iningthenumber

ofauthenticvariablescontainedinthefinal

model.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S67

4.The

populationmultiplecoeffi

cientof

de-

term

inationcouldbefaithfullyestimated

byadopting

astatisticthat

isadjusted

bythe

totalnumber

ofcandidatepredictorvari-

ablesrather

than

thenumber

ofvariables

inthefinalmodel”.

�Globaltest

withpd.f.insignificant→

stop

Variableselectionmethods

57:

�Forwardselection,

backwardelimination

�Stoppingrule:“residualχ2 ”

withd.f.=

no.can-

didatesremaining

atcurrentstep

�Testforsignificanceor

useAkaike’sinform

ation

criterion(AIC

5 ),here

χ2−

2×d.f.

�Betterto

usesubjectmatterknow

ledge!

�Nocurrentlyavailablestopping

rulewas

developed

forstepwise,

only

forcomparing

2pre-specified

models[16,

Section

1.3]

�Roecker

95studiedforwardselection(FS),allpos-

siblesubsetsselection(APS),fullfits

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S68

�APSmorelikelyto

select

smaller,

less

accurate

modelsthan

FS

�Neither

asaccurate

asfullmodel

fitunless

>1 2

candidatevariablesredundantor

unnecessary

�Step-downis

usually

betterthan

forward80

and

canbeused

efficientlywithmaximum

likelihood

estimation74

�Fruitless

totrydifferentstepwisemethods

tolook

foragreem

ent1

20

�Bootstrap

canhelp

decide

betweenfullandre-

ducedmodel

�Fullmodelfits

givesmeaningfulconfidenceinter-

vals

withstandard

form

ulas,C.I.afterstepwise

does

not3

,16,67

�Datareduction(groupingvariables)

canhelp

�Using

thebootstrap

toselect

important

variables

forinclusionin

thefinalmodel98

isproblematic6

�Itisnotlogicalthat

apopulationregression

coef-

ficientwould

beexactlyzero

justbecause

itsesti-

matewas“insignificant”

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S69

3.4

Overfi

ttingand

Lim

itson

NumberofPredictors

�Concerned

withavoiding

overfitting

�Assum

etypicalproblem

inmedicine,epidem

iology,

andthesocialsciencesinwhich

thesignal:noisera-

tioissm

all(higherratios

allowformoreaggressive

modeling)

�pshould

be<

m 1558,59,88,89,101,114

�p=numberofparametersinfullmodelornumber

ofcandidateparametersin

astepwiseanalysis

Table

3.1:

Lim

itingSample

Sizes

forVariousResponse

Variables

Typeof

Respon

seVariable

Lim

itingSam

ple

Sizem

Con

tinuou

sn(total

sample

size)

Binary

min(n

1,n

2)

b

Ordinal

(kcategories)

n−

1 n2

∑

k i=1n3 i

c

Failure

(survival)time

number

offailures

d

�Narrowlydistributedpredictor→

even

higher

n

�pincludes

allvariablesscreened

forassociation

withresponse,includinginteractions

aIf

oneco

nsidersth

epower

ofatw

o-sample

binomialtest

comparedwithaW

ilcoxontest

ifth

eresp

onse

could

bemadeco

ntinuous

andth

eproportionaloddsassumptionholds,

theeff

ectivesample

size

forabinary

resp

onse

is3n1n2/n≈

3min(n

1,n

2)if

n1

nis

nea

r0or

1[119,Eq.10,15].

Heren1andn2are

themarginalfreq

uen

cies

ofth

etw

oresp

onse

levels[89].

bBased

on

thepower

ofa

proportionaloddsmodel

two-sample

test

when

themarginalcell

sizesforth

eresp

onse

are

n1,...,n

k,

comparedwithallcellsizeseq

ualto

unity(response

isco

ntinuous)

[119,Eq,3].

Ifallcellsizesare

equal,th

erelativeeffi

cien

cyofhavingk

resp

onse

categories

comparedto

aco

ntinuousresp

onse

is1−

1

k2[119,Eq.14],e.g.,a5-lev

elresp

onse

isalm

ost

aseffi

cien

tasaco

ntinuous

oneifproportionaloddsholdsacross

category

cutoffs.

cThis

isapproxim

ate,asth

eeff

ectivesample

size

may

sometim

esbeboosted

somew

hatby

censored

observations,

especially

for

non-proportionalhaza

rdsmethodssu

chasW

ilcoxon-typetests11.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S70

�Univariablescreening(graphs,crosstabs,etc.)in

nowayreducesmultiplecomparisonproblemsof

modelbuilding109

3.5

Shrinkage

�Slopeof

calibration

plot;regression

tothemean

�Statistical

estimationprocedure—

“pre-shrunk”

models

�Aren’tregression

coeffi

cients

OKbecause

they’re

unbiased?

�Problem

isin

howweusecoeffi

cientestimates

�Consider20

samples

ofsize

n=50

from

U(0,1)

�Com

putegroupmeans

andplot

inascendingorder

�Equivalentto

fittingan

interceptand19

dummies

usingleastsquares

�Resultgeneralizes

togeneralproblemsin

plotting

Yvs.Xβ

set.seed(1

23)

n←

50

y←

runif(2

0*n)

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S71

gro

up←

rep(1:2

0,each=n)

ybar←

tapply

(y,

group,

mean)

ybar←

sort

(ybar)

plo

t(1:2

0,

ybar,

type=

'n',

axes=

FALSE,

ylim=c(.3

,.7

),

xlab=

'Group

',

ylab=

'Group

Mean')

lines(1:2

0,

ybar)

poin

ts(1:2

0,

ybar,

pch=20,

cex=

.5)

axis

(2)

axis

(1,

at=1:20,

labels=FALSE)

for(j

in1:2

0)

axis

(1,

at=

j,

labels=names

(ybar)[j])

abline(h=

.5,

col=

gra

y(.8

5))

Gro

up

Group Mean

●

●

●●

●●

●

●●

●●

●

●●

●●

●●

●

●

0.30.40.50.60.7

166

172

1014

209

87

1118

54

31

1513

1912

Figure

3.1:

Sorted

meansfrom

20samplesofsize

50from

auniform

[0,1]distribution.Thereference

lineat0.5

depicts

thetruepopulationvalueofallofthemeans.

�Prevent

shrinkageby

usingpre-shrinkage

�Spiegelhalter

103 :

var.

selectionarbitrary,

better

prediction

usually

resultsfrom

fittingallcandidate

variablesandusingshrinkage

�Shrinkage

closer

tothat

expectedfrom

fullmodel

fitthan

basedon

numberof

significant

variables29

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S72

�Ridge

regression

75,111

�Penalized

MLE53,61,112

�Heuristicshrinkageparameterof

vanHouwelingen

andleCessie[111,Eq.

77]

γ=

modelχ2−p

modelχ2

,

�OLS:γ=

n−p−

1n−1R2 adj/R2

R2 adj=1−(1−R2 )

n−1

n−p−

1

�pcloseto

no.candidatevariables

�Copas

[29,

Eq.

8.5]

adds

2to

numerator

3.6

Collinearity

�Whenat

least1predictorcanbepredictedwell

from

others

�Can

beablessing

(datareduction,

transforma-

tions)

�↑s.e.

ofβ,↓pow

er

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S73

�Thisis

appropriate→

asking

toomuchof

the

data

[25,

p.173]

�Variablescompeteinvariableselection,chosen

one

arbitrary

�Doesnotaffectjointinfluenceof

asetof

highly

correlated

variables(use

multipled.f.tests)

�Doesnotat

allaffectpredictionson

model

con-

structionsample

�Doesnotaffectpredictionson

newdata

[85,

pp.

379-381]

if

1.Extremeextrapolationnotattempted

2.New

data

have

sametypeof

collinearitiesas

originaldata

�Example:

LDLandtotalcholesterol–problem

onlyifmoreinconsistent

innewdata

�Example:

ageandage2

–no

problem

�One

way

toquantify

foreach

predictor:

variance

inflationfactors(VIF)

�Generalapproach

(maximum

likelihood)

—trans-

form

inform

ationmatrixto

correlationform

,VIF=diagon

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S74

ofinverse35,118

�See

Belsley

[9,pp.28-30]

forproblemswithVIF

�Easyapproach:SASVARCLUSprocedure97,Svar-

clusfunction,otherclustering

techniques:group

highlycorrelated

variables

�Can

scoreeach

group(e.g.,firstprincipalcompo-

nent,PC134);summaryscores

notcollinear

3.7

Data

Reduction

�Unlessn>>

p,modelunlikelyto

validate

�Datareduction:↓p

�Use

theliteratureto

eliminateunimportant

vari-

ables.

�Elim

inatevariableswhosedistributionsaretoonar-

row.

�Elim

inatecandidatepredictorsthat

aremissing

inalargenumberofsubjects,especially

ifthosesame

predictorsarelikelyto

bemissing

forfuture

appli-

cationsof

themodel.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S75

�Use

astatisticaldatareductionmethodsuch

asin-

completeprincipalcom

ponentsregression,nonlin-

eargeneralizations

ofprincipalcomponents

such

asprincipalsurfaces,sliced

inverseregression,vari-

able

clustering,or

ordinary

clusteranalysison

ameasure

ofsimilarity

betweenvariables.

3.7.1

Redundancy

Analysis

�Rem

ovevariablesthat

have

poordistributions

–E.g.,categoricalvariables

withfewerthan

2cat-

egorieshaving

atleast20

observations

�Use

flexibleadditive

parametricadditive

modelsto

determ

inehowwelleachvariablecanbepredicted

from

theremaining

variables

�Variables

dropped

instepwisefashion,

removing

themostpredictablevariableat

each

step

�Rem

aining

variablesused

topredict

�Process

continuesuntilno

variablestillin

thelist

ofpredictors

canbepredictedwithan

R2or

ad-

justed

R2greaterthan

aspecified

thresholdor

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S76

untildropping

thevariable

withthehighestR2

(adjustedor

ordinary)wouldcauseavariablethat

was

dropped

earlierto

nolonger

bepredictedat

thethresholdfrom

thenowsm

allerlistof

predic-

tors

�R/S

function

redunin

Hmiscpackage

�Related

toprincipalvariables82

butfaster

3.7.2

Variable

Clustering

�Goal:Separatevariablesinto

groups

–variableswithingroupcorrelated

witheach

other

–variablesnotcorrelated

withnon-groupmem

-bers

�Score

each

dimension,stop

trying

toseparate

ef-

fectsof

factorsmeasuring

samephenom

enon

�Variableclustering

34,97(oblique-rotationPCanal-

ysis)→

separate

variablesso

that

firstPCisrep-

resentativeof

group

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S77

�Can

also

dohierarchicalclusteranalysison

similar-

itymatrixbasedon

squaredSpearm

anor

Pearson

correlations,or

moregenerally,Hoeffding’sD

65.

3.7.3

Tra

nsform

ationandSca

lingVariablesW

ithoutUsingY

�Reducepby

estimatingtransformations

usingas-

sociations

withotherpredictors

�Purelycategoricalpredictors–correspondenceanal-

ysis26,33,55,76,83

�Mixture

ofqualitativeandcontinuous

variables:

qualitativeprincipalcomponents

�Maximum

totalvariance(M

TV)ofYoung,T

akane,

deLeeuw

83,126

1.Com

putePC1ofvariablesusingcorrelationma-

trix

2.Use

regression

(withsplines,dummies,etc.)to

predictPC1from

each

X—

expand

each

Xj

andregressitseparatelyon

PC1to

getworking

transformations

3.Recom

pute

PC1on

transformed

Xs

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S78

4.Repeat3-4times

untilvariationexplainedby

PC1plateaus

andtransformations

stabilize

�Maximum

generalized

variance

(MGV)methodof

Sarle[72,

pp.1267-1268]

1.Predict

each

variablefrom

(current

transforma-

tionsof)allothervariables

2.For

each

variable,expand

itinto

linearandnon-

linearterm

sor

dummies,compute

firstcanoni-

calvariate

3.For

exam

ple,ifthereareonlytwovariablesX1

andX2representedas

quadraticpolynom

ials,

solvefora,b,c,d

such

that

aX1+

bX2 1has

maximum

correlationwithcX

2+dX

2 2.

4.Goalisto

transform

each

var.so

that

itismost

similarto

predictionsfrom

othertransformed

variables

5.Doesnotrelyon

PCsor

variableclustering

�MTV

(PC-based

insteadof

canonicalvar.)

and

MGVimplem

entedin

SASPROC

PRINQUAL72

1.Allowsflexibletransformations

includingmono-

tonicsplines

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S79

2.Doesnotallowrestricted

cubicsplines,so

may

beunstableunless

monotonicityassumed

3.Allowssimultaneousimputation

butoftenyields

wild

estimates

3.7.4

Sim

ultaneousTra

nsform

ation

andIm

putation

StranscanFunctionforDataReduction

&Im

puta-

tion �Initializemissingsto

medians

(ormostfrequent

category)

�Initializetransformations

tooriginalvariables

�Takeeach

variablein

turn

asY

�Exclude

obs.

missing

onY

�ExpandY

(splineor

dummyvariables)

�Score

(transform

Y)usingfirstcanonicalvariate

�Missing

Y→

predictcanonicalvariatefrom

Xs

�The

imputedvalues

canoptionally

beshrunk

toavoidoverfittingforsm

allnor

largep

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S80

�Constrain

imputedvalues

tobein

rangeof

non-

imputedones

�Im

putationson

originalscale

1.Continuous→

back-solve

withlinearinterpola-

tion

2.Categorical→

classification

tree

(mostfreq.cat.)

ormatch

tocategory

whose

canonicalscoreis

closestto

onepredicted

�Multipleimputation

—bootstrap

orapprox.B

ayesian

boot.

1.Sam

pleresidualsmultipletimes

(defaultM

=5)

2.Are

on“optimally”transformed

scale

3.Back-transform

4.fit.mult.imputeworks

witharegImputeandtran-

scanoutput

toeasily

getimputation-corrected

variancesandavg.

β

�Optionto

insertconstantsas

imputedvalues

(ig-

noredduring

transformationestimation);helpful

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S81

whenalabvaluemay

bemissing

because

thepa-

tientreturned

tonorm

al

�Im

putationsandtransformed

values

may

beeasily

obtained

fornewdata

�AnSfunction

Functionwill

create

aseries

ofS

functionsthat

transform

each

predictor

�Example:

n=415acutelyillpatients

1.Relateheartrateto

meanarterialbloodpressure

2.Twobloodpressuresmissing

3.Heart

rate

notmonotonically

relatedto

blood

pressure

4.See

Figure3.2

require(Hmisc)

getH

data

(support

)#

Get

data

frame

from

web

site

heart.rate

←support$hrt

blo

od.p

ressure←

support$meanbp

blo

od.p

ressure

[400:4

01]

Mean

Arteria

lBlood

Pre

ssure

Day

3[1

]151

136

blo

od.p

ressure

[400:4

01]←

NA

#Create

two

missings

d←

data

.fra

me(heart

.rate,

blo

od.p

ressure

)par(pch=46)

w←

transcan(∼

heart.rate

+blo

od.p

ressure

,transfo

rmed=TRUE,

imputed=TRUE,

show.na=

TRUE,

data=d)

Converg

ence

criterio

n:2

.901

0.035

0.007

Converg

ence

in4

iteratio

ns

R2

achieved

inpredic

tin

geach

varia

ble

:

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S82

heart

.ra

te

blood.pre

ssure

0.259

0.259

Adju

sted

R2:

heart

.ra

te

blood.pre

ssure

0.254

0.253

w$im

puted$blo

od.p

ressure

400

401

132.4057

109.7741

plo

t(heart

.rate,

blo

od.p

ressure

)t←

w$transfo

rmed

plo

t(t[,'heart.rate

'],

t[,'blo

od.p

ressure

'],

xlab=

'Tra

nsform

ed

hr',

ylab=

'Tra

nsform

ed

bp

')

spe←

round(c(sp

earm

an(heart

.rate,

blo

od.p

ressure

),

spearm

an(t[,'heart.rate

'],

t[,'blo

od.p

ressure

'])

),

2)

ACE(Alternating

ConditionalExpectation)ofBreiman

andFriedman

17

1.Usesnonparam

etric“super

smoother”48

2.Allowsmonotonicityconstraints,categoricalvars.

3.Doesnothandlemissing

data

�These

methods

find

marginaltransformations

�Check

adequacy

oftransformations

usingY

1.Graphical

2.Nonparametricsm

oothers(X

vs.Y)

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S83

050

100

150

200

250

300

02468

hear

t.rat

e

Transformed heart.rate

R2=

0.26

050

100

150

−8−6−4−20

bloo

d.pr

essu

re

Transformed blood.pressure

R2=

0.26

2 m

issi

ng

050

100

150

200

250

300

050100150

hear

t.rat

e

blood.pressure

02

46

8

−8−6−4−20

Tran

sfor

med

hr

Transformed bp

Figure

3.2:

Transform

ationsfitted

using

transcan.Tickmarksindicate

thetw

oim

putedvalues

forbloodpressure.

Thelower

left

plotcontainsraw

data

(Spearm

anρ=−0.02);

thelower

rightis

ascatterplotofthecorresponding

transform

edvalues

(ρ=−0.13).

Data

courtesyoftheSUPPORT

study70.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S84

3.Expandoriginalvariableusingspline,test

addi-

tionalpredictive

inform

ationover

originaltrans-

form

ation

3.7.5

Sim

ple

Sco

ringofVariable

Clusters

�Try

toscoregroups

oftransformed

variableswith

PC1

�Reduces

d.f.by

pre-transformingvar.andby

com-

bining

multiplevar.

�Later

may

wantto

breakgroupapart,butdelete

allvariablesin

groups

whose

summaryscores

donotaddsignificant

inform

ation

�Som

etimes

simplifyclusterscoreby

findingasub-

setofitsconstituentvariableswhich

predictitwith

high

R2 .

Seriesof

dichotom

ousvariables:

�Construct

X1=

0-1accordingto

whether

any

variablespositive

�Construct

X2=

number

ofpositives

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S85

�Testwhether

originalvariablesaddto

X1or

X2

3.7.6

Sim

plifyingClusterSco

res

3.7.7

How

Much

Data

ReductionIs

Nece

ssary

?

Using

ExpectedShrinkage

toGuide

DataReduction

�Fitfullmodelwithallcandidates,pd.f.,LRlike-

lihoodratioχ2

�Com

pute

γ

�If<

0.9,

consider

shrunken

estimator

from

whole

model,or

data

reduction(again

notusingY)

�qregression

d.f.forreducedmodel

�Assum

ebestcase:discardeddimensionshadno

associationwithY

�Expectedloss

inLRisp−q

�New

shrinkage[LR−(p−q)−q]/[LR−(p−q)]

�Solve

forq→

q≤

(LR−p)/9

�Under

theseassumptions,no

hopeunless

original

LR>

p+9

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S86

�Noχ2lostby

dimension

reduction→

q≤

LR/10

Example:

�Binarylogisticmodel,45

events

on150subjects

�10:1

rule→

analyze4.5d.f.total

�Analyst

wishesto

includeage,sex,

10others

�Not

know

nifagelinearor

ifageandsexadditive

�4knots→

3+1+1d.f.forageandsexifrestrict

interactionto

belinear

�Fullmodelwith15

d.f.hasLR=50

�Expectedshrinkagefactor

(50−15)/50

=0.7

�LR>

15+9=24→

reductionmay

help

�Reduction

toq=(50−15)/9≈

4d.f.necessary

�Haveto

assumeagelinear,reduce

other10

to1

d.f.

�Separatehypothesistestsintended→

usefullmodel,

adjust

formultiplecomparisons

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S87

Summary

ofSomeData

Reduction

Meth

ods

Goals

Reaso

ns

Meth

ods

Grouppredictorsso

that

each

group

represents

asingledim

ension

that

canbesummarized

with

asinglescore

•↓

d.f.

arising

from

multiplepredictors

•MakePC

1morerea-

sonablesummary

Variableclustering

•Subject

matter

know

ledge

•Group

predictors

tomaxim

ize

prop

ortion

ofvariance

explained

byPC

1of

each

group

•Hierarchical

cluster-

ing

using

amatrix

ofsimilarity

measures

betweenpredictors

Transform

predictors

•↓d.f.dueto

nonlin-

earanddummyvari-

ablecomponents

•Allows

predictors

tobe

optimally

com-

bined

•MakePC

1morerea-

sonablesummary

•Use

incustom

ized

model

for

imputing

missing

values

oneach

predictor

•Maxim

um

totalvari-

ance

onagroupof

re-

latedpredictors

•Canonical

variates

onthetotalset

ofpredic-

tors

Score

agroupof

predic-

tors

↓d.f.forgroupto

unity•PC

1

•Sim

plepointscores

Multiple

dim

ensional

scoringof

allpredictors

↓d.f.forallpredictors

combined

Principal

components

1,2,...,k,k

<p

computed

from

all

transformed

predictors

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S88

3.8

OverlyIn

fluentialObservations

�Every

observationshould

influencefit

�Major

resultsshould

notrest

on1or

2obs.

�Overlyinfl.obs.→↑variance

ofpredictions

�Alsoaffects

variableselection

Reasons

forinfluence:

�Too

fewobservations

forcomplexityof

model(see

Sections3.7,

3.3)

�Datatranscriptionor

entryerrors

�Extremevalues

ofapredictor

1.Som

etimes

subjectso

atypical

should

remove

from

dataset

2.Som

etimes

truncate

measurements

where

data

densityends

3.Example:

n=

4000,2000

deaths,white

blood

countrange500-100,000,.05,.95quantiles=

2755,

26700

4.Linearsplinefunction

fit

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S89

5.Sensitive

toWBC>

60000(n

=16)

6.Predictions

stable

iftruncate

WBC

to40000

(n=46

above40000)

�Disagreem

ents

betweenpredictors

andresponse.

Ignore

unless

extrem

evalues

oranotherexplana-

tion

�Example:

n=

8000,oneextrem

epredictorvalue

noton

straight

linerelationshipwithother(X

,Y)

→χ2=36

forH0:linearity

StatisticalMeasures:

�Leverage:

capacity

tobeinfluential(notnecessar-

ilyinfl.)

Diagonalsof“hat

matrix”

H=X(X′ X

)−1 X′—

measureshowan

obs.

predictsitsow

nresponse

10

�hii

>2(p+

1)/n

may

signal

ahigh

leverage

point

10

�DFBETAS:changeinβupon

deletion

ofeach

obs,

scaled

bys.e.

�DFFIT:change

inXβupon

deletion

ofeach

obs

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S90

�DFFITS:DFFIT

standardized

bys.e.

ofβ

�Som

eclassifyobsas

overlyinfluentialw

hen|DFFITS|>

2√

(p+1)/(n−p−1)

10

�Othersexam

ineentire

distribution

for“outliers”

�Nosubstituteforcarefulexaminationofdata23,102

�Maximum

likelihoodestimationrequires1-step

ap-

proximations

3.9

ComparingTwoM

odels

�Level

playingfield(independent

datasets,same

no.candidated.f.,carefulbootstrapping)

�Criteria:

1.calibration

2.discrimination

3.face

validity

4.measurementerrorsin

required

predictors

5.useof

continuous

predictors

(which

areusually

betterdefinedthan

categoricalones)

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S91

6.om

ission

of“insignificant”variablesthatnonethe-

less

makesenseas

risk

factors

7.simplicity(thoughthisislessimportant

withthe

availabilityof

computers)

8.lack

offitforspecifictypes

ofsubjects

�Goalisto

rank-order:ignore

calibration

�Otherwise,

dism

issamodel

having

poorcalibra-

tion

�Goodcalibration→

compare

discrimination(e.g.,

R286,model

χ2 ,

Som

ers’

Dxy,Spearm

an’s

ρ,

area

underROCcurve)

�Worthwhileto

compare

modelson

ameasure

not

used

tooptimizeeithermodel,e.g.,meanabsolute

error,medianabsolute

errorifusingOLS

�Rankmeasuresmay

notgive

enough

creditto

ex-

trem

epredictions→

model

χ2 ,R2 ,

exam

ineex-

trem

esof

distribution

ofY

�Examinedifferencesin

predictedvalues

from

the

twomodels

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S92

�See

90,91fordiscussionsandexam

plesoflowpow

erfortestingdifferencesin

ROCareas.

3.10

Summary

:Possible

ModelingStrategies

Greenland

56discussesmanyimportant

points:

�Stepw

isevariableselectionon

confoundersleaves

important

confoundersuncontrolled

�Shrinkage

isfarsuperiorto

variableselection

�Variableselectiondoesmoredamageto

confidence

intervalwidthsthan

topoint

estimates

�Claimsaboutunbiasedness

ofordinary

MLEsare

misleadingbecause

they

assumethemodeliscor-

rect

andistheonlymodelentertained

�“m

odelsneed

tobecomplex

tocaptureuncertainty

abouttherelations...anhonest

uncertaintyas-

sessmentrequires

parametersforalleffects

that

weknow

may

bepresent.

Thisadvice

isimplicit

inan

antiparsimonyprincipleoftenattributed

to

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S93

L.J.

Savage‘Allmodelsshould

beas

bigas

anelephant’(see

Draper,1995)”

GlobalStrategies

�Use

amethodknow

nnotto

workwell(e.g.,step-

wisevariableselectionwithout

penalization;

recur-

sive

partitioning),documenthowpoorlythemodel

perform

s(e.g.usingthebootstrap),

andusethe

modelanyw

ay

�Develop

ablackbox

model

that

perform

spoorly

andisdifficultto

interpret(e.g.,does

notincor-

poratepenalization)

�Develop

ablackbox

modelthat

perform

swelland

isdifficultto

interpret

�Develop

interpretableapproximations

totheblack

box

�Develop

aninterpretablemodel(e.g.give

priority

toadditive

effects)that

perform

swelland

islikely

toperform

equally

wellon

future

data

from

the

samestream

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S94

Preferred

Strategyin

aNutshell

�Decidehowmanyd.f.canbespent

�Decidewhere

tospendthem

�Spendthem

�Don’treconsider,especially

ifinferenceneeded

3.10.1

DevelopingPredictiveM

odels

1.Assem

bleaccurate,pertinent

data

andlots

ofit,

withwidedistributionsforX.

2.Formulategood

hypotheses—

specifyrelevant

candidatepredictorsandpossibleinteractions.Don’t

useY

todecide

which

X’sto

include.

3.Characterizesubjectswithmissing

Y.Deletesuch

subjectsinrarecircum

stances32 .

Forcertainmod-

elsitiseffective

tomultiplyimpute

Y.

4.Characterizeandimputemissing

X.Inmostcases

usemultipleimputation

basedon

XandY

5.For

each

predictorspecifycomplexityor

degree

ofnonlinearity

that

should

beallowed

(morefor

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S95

important

predictorsor

forlargen)(Section

3.1)

6.Dodata

reductionifneeded

(pre-transform

ations,

combinations),or

usepenalized

estimation61

7.Use

theentire

samplein

modeldevelopm

ent

8.Can

dohighly

structured

testingto

simplify“ini-

tial”model

(a)Testentire

groupof

predictorswithasingleP-

value

(b)Makeeach

continuous

predictorhave

samenum-

ber

ofknots,andselect

thenumber

that

opti-

mizes

AIC

(c)Testthecombinedeffectsof

allnonlinearterm

swithasingleP-value

9.Maketestsof

linearity

ofeffectsinthemodelonly

todemonstrate

toothersthat

such

effects

areof-

tenstatistically

significant.Don’tremoveindivid-

ualinsignificant

effects

from

themodel.

10.C

heck

additivityassumptions

bytestingpre-specified

interactionterm

s.Use

aglobal

test

andeither

keep

allor

delete

allinteractions.

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S96

11.C

heck

toseeifthereareoverly-influentialobser-

vations.

12.C

heck

distributionalassumptions

andchooseadif-

ferent

modelifneeded.

13.D

olim

ited

backwards

step-dow

nvariableselection

ifparsimonyismoreimportant

that

accuracy

103 .

But

confidencelim

its,etc.,mustaccountforvari-

ableselection(e.g.,bootstrap).

14.T

hisisthe“final”model.

15.Interpret

themodel

graphically

andby

comput-

ingpredictedvaluesandappropriateteststatistics.

Com

pute

pooledtestsof

associationforcollinear

predictors.

16.V

alidatethismodelforcalibration

anddiscrimina-

tion

ability,preferablyusingbootstrapping.

17.Shrinkparameter

estimates

ifthereisoverfitting

butno

furtherdata

reductionis

desired(unless

shrinkagebuilt-into

estimation)

18.W

henmissing

values

wereimputed,

adjust

final

variance-covariancematrixforimputation.Dothis

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S97

asearlyas

possiblebecauseitwillaffectotherfind-

ings.

19.W

henallstepsof

themodelingstrategy

canbe

automated,considerusingFaraw

ay’smethod45

topenalizefortherandom

ness

inherent

inthemul-

tiplesteps.

20.D

evelop

simplificationsto

thefinalm

odelas

needed.

3.10.2

DevelopingM

odels

forEffect

Estim

ation

1.Lessneed

forparsimony;even

lessneed

toremove

insignificant

variablesfrom

model(otherwiseCLs

toonarrow

)

2.Carefulconsiderationofinteractions;inclusionforces

estimates

tobeconditionalandraises

variances

3.If

variable

ofinterest

ismostlytheonethat

ismissing,multipleimputation

less

valuable

4.Com

plexityof

mainvariablespecified

bypriorbe-

liefs,comprom

isebetweenvariance

andbias

5.Don’tpenalizeterm

sforvariableof

interest

CHAPTER

3.

MULTIVARIA

BLE

MODELIN

GSTRATEGIE

S98

6.Modelvalidationless

necessary

3.10.3

DevelopingM

odels

forHypoth

esisTesting

1.Virtuallysameas

previous

strategy

2.Interactions

requiretestsof

effectby

varyingval-

uesof

anothervariable,or“m

aineffect+

inter-

action”jointtests(e.g.,istreatm

enteffective

for

either

sex,

allowingeffects

tobedifferent)

3.Validationmay

help

quantify

overadjustment

Chapter4

Describing,Resampling,Validating,and

Sim

plifyingth

eM

odel

4.1

Describingth

eFitted

Model

4.1.1

Interp

retingEffects

�Regressioncoeffi

cients

if1d.f.per

factor,no

in-

teraction

�Notstandardized

regression

coeffi

cients

�Manyprogramsprintmeaningless

estimates

such

aseffectof

increasing

age2

byoneunit,holding

ageconstant

�Needto

accountfornonlinearity,interaction,

and

99

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

100

usemeaningfulranges

�Formonotonicrelationships,estimateXβatquar-

tilesof

continuous

variables,separatelyforvarious

levelsof

interactingfactors

�Subtractestimates,anti-log,e.g.,to

getinter-

quartile-range

odds

orhazardsratios.BaseC.L.

ons.e.

ofdifference.

�Ploteffectofeach

predictoron

Xβor

sometrans-

form

ationof

Xβ.See

also

69.

�Nom

ogram

�Use

regression

tree

toapproximatethefullmodel

4.1.2

IndexesofM

odelPerform

ance

ErrorM

easu

res

�Centraltendency

ofprediction

errors

–Meanabsolute

prediction

error:mean|Y−Y|

–Meansquaredprediction

error

*BinaryY:Brier

score(quadraticproper

scor-

ingrule)

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

101

–Logarithm

icproperscoringrule(avg.log-likelihood)

�Discriminationmeasures

–Purediscrimination:

rank

correlationof

(Y,Y

)

*Spearm

anρ,Kendallτ,Som

ers’Dxy

*Y

binary→

Dxy=2×(C−

1 2)

C=concordanceprobability=area

underre-

ceiveroperatingcharacteristiccurve∝Wilcoxon-

Mann-Whitney

statistic

–Mostlydiscrimination:

R2

*R2 adj—

overfittingcorrectedifmodelpre-specified

–Brier

scorecanbedecomposed

into

discrimina-

tion

andcalibration

components

–Discriminationmeasuresbasedon

variationin

Y *regression

sum

ofsquares

*g–index

�Calibration

measures

–calibration–in–the–large:

averageYvs.average

Y

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

102

–high-resolutioncalibration

curve(calibration–in–

the–sm

all)

–calibration

slopeandintercept

–maximum

absolute

calibration

error

–meanabsolute

calibration

error

–0.9quantileof

calibration

error

g–In

dex

�Based

onGini’s

meandifference

–meanover

allpossiblei6=jof|Z

i−Zj|

–interpretable,robust,highlyeffi

cientmeasureof

variation

�g=

Gini’s

meandifference

ofXiβ

=Y

�Example:

Y=systolicbloodpressure;g=11mmHg

istypicaldifference

inY

�Independent

ofcensoringetc.

�For

modelsin

which

anti-log

ofdifference

inY

representmeaningfulratios

(oddsratios,hazard

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

103

ratios,ratioof

medians):

g r=exp(g)

�For

modelsinwhich

Ycanbeturned

into

aprob-

ability

estimate(e.g.,logisticregression):

g p=

Gini’s

meandifference

ofP

�These

g–indexes

represente.g.“typical”odds

ra-

tios,“typical”risk

differences

�Can

define

partialg

4.2

TheBootstrap

�Ifknow

populationmodel,usesimulationor

an-

alytic

derivationsto

studybehaviorof

statistical

estimator

�SupposeY

hasacumulativedist.fctn.F(y)=

Prob{Y≤

y}

�Wehave

sampleof

size

nfrom

F(y),

Y1,Y2,...,Yn

�Steps:

1.Repeatedlysimulatesampleof

size

nfrom

F

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

104

2.Com

pute

statisticof

interest

3.Study

behaviorover

Brepetitions

�Example:

1000

samples,1000

samplemedians,

compute

theirsamplevariance

�F

unknow

n→

estimateby

empiricaldist.fctn.

Fn(y)=

1 n

n∑ i=1I(Y

i≤

y),

where

I(w

)is1ifw

istrue,0otherwise.

�Example:

sampleof

size

n=

30from

anorm

aldistribution

withmean100andSD10

set.seed(6)

x←

rnorm

(30,

100,

20)

xs←

seq(5

0,

150,

length

=150)

cdf←

pnorm

(xs,

100,

20)

plo

t(xs,

cdf,

type=

'l',

ylim=c(0,1

),

xlab=expre

ssio

n(x),

ylab=expre

ssio

n(paste

(”Prob[”,X≤

x,

”]”)))

lines(ecdf(x),

cex=

.5)

�Fncorresponds

todensityfunction

placingprob-

ability

1 nat

each

observed

data

point

(k nifpoint

duplicated

ktimes)

�Pretend

that

F≡

Fn

�Sam

plingfrom

Fn≡

samplingwithreplacem

ent

from

observed

data

Y1,...,Yn

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

105

6080

100

120

140

0.00.20.40.60.81.0

x

Prob[X≤x]

●

●

●●●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●●

●

●

●

Figure

4.1:

Empiricalandpopulationcumulativedistributionfunctions

�Large

n→

selects1−

e−1≈

0.632oforiginaldata

pointsin

each

bootstrap

sampleat

leastonce

�Som

eobservations

notselected,others

selected

morethan

once

�Efron’s

bootstrap→

general-purposetechnique

forestimatingproperties

ofestimatorswithout

as-

sumingor

know

ingdistribution

ofdata

F

�TakeB

samplesofsize

nwithreplacem

ent,choose

Bso

that

summarymeasureofindividualstatistics

≈summaryifB

=∞

�Bootstrap

basedon

distribution

ofobserved

dif-

ferences

betweenaresampled

parameter

estimate

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

106

andtheoriginalestimatetelling

usaboutthedis-

tributionof

unobservable

differencesbetweenthe

originalestimateandtheunknow

nparameter

Example:

Data(1,5,6,7,8,9),

obtain

0.80

confi-

denceinterval

forpopulationmedian,

andestimate

ofpopulationexpectedvalueof

samplemedian(only

toestimatethebias

intheoriginal

estimateof

the

median).

options(dig

its=3)

y←

c(2,5

,6,7

,8,9

,10,1

1,1

2,1

3,1

4,1

9,2

0,2

1)

y←

c(1,5

,6,7

,8,9

)set.seed(17)

n←

length

(y)

n2←

n/2

n21←

n2+1

B←

400

M←

double

(B)

plo

t(0,

0,

xlim=c(0,B

),

ylim=c(3,9

),

xlab=”Bootstrap

Samples

Used”,

ylab=”M

ean

and

0.1

,0.9

Quantiles”,

type=

”n”)

for(i

in1:B

){

s←

sample

(1:n

,n,

repla

ce=T)

x←

sort

(y[s])

m←

.5*(x[n

2]+

x[n21])

M[i]←

mif

(i≤

20){

w←

as.c

hara

cter(x)

cat(w

,”&

&”,

sprin

tf('%

.1f',m

),

if(i<

20)

”\\\\\n”

else

”\\\\\\hline\n”

file=

'∼/doc/rm

s/validate/tab.tex

',

append=i>

1)

} poin

ts(i,

mean(M

[1:i])

,pch=46)

if(i≥

10)

{q←

quantile(M

[1:i],

c(.1

,.9

))

poin

ts(i,

q[1

],

pch=46,

col=

'blu

e')

poin

ts(i,

q[2

],

pch=46,

col=

'blu

e')

}}

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

107

table

(M)

M1

33.5

44.5

55.5

66.5

77.5

88.5

96

10

78

223

43

75

59

66

47

42

11

1

hist(M

,ncla

ss=

length

(unique(M

)),

xlab=””,

main=””)

First

20samples:

BootstrapSam

ple

Sam

pleMedian

166789

6.5

155568

5.0

578999

8.5

777889

7.5

157799

7.0

156678

6.0

788888

8.0

555799

6.0

155779

6.0

155778

6.0

115577

5.0

115578

5.0

155778

6.0

156788

6.5

156799

6.5

667789

7.0

157889

7.5

668999

8.5

115569

5.0

168999

8.5

�Histogram

tells

uswhether

wecanassumenor-

malityforthebootstrap

medians

orneed

touse

quantilesof

medians

toconstructC.L.

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

108

010

020

030

040

0

3456789

Boo

tstr

ap S

ampl

es U

sed

Mean and 0.1, 0.9 Quantiles

Frequency

24

68

0204060

Figure

4.2:

Estim

atingproperties

ofsample

medianusingthebootstrap

�Needhigh

Bforquantiles,low

forvariance

(but

see[14])

4.3

ModelValidation

4.3.1

Introduction

�Externalvalidation(best:

anothercountryat

an-

othertime);also

validatessampling,measurements

�Internal

–apparent

(evaluatefiton

samedata

used

tocre-

atefit)

–data

splitting

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

109

–cross-validation

–bootstrap:getoverfitting-correctedaccuracy

in-

dex

�Bestway

tomakemodelfitdata

wellisto

discard

muchof

thedata

�Predictions

onanotherdatasetwillbeinaccurate

�Needunbiased

assessmentof

predictive

accuracy

4.3.2

WhichQuantitiesShould

BeUsedin

Validation?

�OLS:R

2isonegood

measureforquantifyingdrop-

offin

predictive

ability

�Example:

n=

10,p

=9,

apparent

R2=

1but

R2willbecloseto

zero

onnewsubjects

�Example:

n=

20,p

=10,apparent

R2=

.9,R2

onnewdata

0.7,

R2 adj

=0.79

�AdjustedR2solves

muchof

thebias

problem

as-

sumingpin

itsform

ulaisthelargestnumber

ofparametersever

exam

ined

againstY

�Few

otheradjusted

indexesexist

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

110

�Alsoneed

tovalidatemodelswithphantom

d.f.

�Cross-validationorbootstrap

canprovideunbiased

estimateof

anyindex;bootstrap

hashigher

preci-

sion

�Twomaintypes

ofquantities

tovalidate

1.Calibration

orreliability:

ability

tomakeunbi-

ased

estimates

ofresponse

(Yvs.Y)

2.Discrimination:

ability

toseparate

responses

OLS:R2 ;

g–index;binary

logisticmodel:ROC

area,equivalentto

rank

correlationbetweenpre-

dicted

probability

ofeventand0/1event

�Unbiasedvalidationnearlyalwaysnecessary,to

de-

tect

overfitting

4.3.3

Data-S

plitting

�Splitdata

into

trainingandtest

sets

�Interestingto

compare

indexof

accuracy

intrain-

ingandtest

�Freezeparametersfrom

training

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

111

�Makesure

youallow

R2=

1−

SSE/S

ST

for

test

sampleto

be<

0

�Don’t

compute

ordinary

R2on

Xβ

vs.Y;this

allowsforlinearrecalibration

aXβ+bvs.Y

�Testsamplemustbelargeenough

toobtain

very

accurate

assessmentof

accuracy

�Trainingsampleiswhat’sleft

�Example:

overallsam

plen=300,training

sample

n=200,

developmodel,freeze

β,predicton

test

sample(n

=100),R2=1−

∑

(Yi−

Xiβ)2

∑

(Yi−

Y)2

.

�Disadvantages

ofdata

splitting:

1.Costlyin↓n16,95

2.Requiresdecisionto

split

atbeginning

ofanal-

ysis

3.Requireslargersampleheldoutthan

cross-validation

4.Results

vary

ifsplit

again

5.Doesnotvalidatethefinalmodel(from

recom-

bineddata)

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

112

6.Not

helpfulin

gettingCLcorrectedforvar.se-

lection

4.3.4

Improvements

onData-S

plitting:Resampling

�Nosacrifice

insamplesize

�Workwhenmodelingprocessautomated

�Bootstrap

excellent

forstudying

arbitrarinessof

variableselection98

�Cross-validationsolvesmanyproblemsofdatasplit-

ting

40,100,111,123

�Exampleof×-validation:

1.Splitdata

atrandom

into

10tenths

2.Leave

out

1 10of

data

atatime

3.Develop

modelon

9 10,includinganyvariablese-

lection,

pre-testing,

etc.

4.Freezecoeffi

cients,evaluate

on1 10

5.Average

R2over

10reps

�Drawbacks:

1.Choiceof

number

ofgroups

andrepetitions

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

113

2.Doesn’tshow

fullvariability

ofvar.selection

3.Doesnotvalidatefullmodel

4.Low

erprecisionthan

bootstrap

5.Needto

do50

repeatsof10-foldcross-validation

toensure

adequate

precision

�Randomizationmethod

1.RandomlypermuteY

2.Optimism

=perform

ance

offitted

modelcom-

paredto

whatexpectby

chance

4.3.5

Validation

Usingth

eBootstrap

�Estimateoptimism

offinalwholesamplefitwith-

outholdingoutdata

�From

original

XandY

select

sampleof

size

nwithreplacem

ent

�Derivemodelfrom

bootstrap

sample

�Applyto

originalsample

�Simplebootstrap

uses

averageof

indexescom-

putedon

originalsample

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

114

�Estimated

optimism

=difference

inindexes

�RepeataboutB

=100times,getaverageex-

pectedoptimism

�Subtractaverageoptimismfrom

apparent

indexin

finalmodel

�Example:

n=1000,have

developed

afinalm

odel

that

ishopefullyreadyto

publish.

Callestimates

from

thisfinalmodelβ.

–finalmodelhasapparent

R2(R

2 app)=0.4

–howinflated

isR2 app?

–getresamples

ofsize

1000

withreplacem

ent

from

original1000

–foreach

resamplecompute

R2 boot

=apparent

R2in

bootstrap

sample

–freeze

thesecoeffi

cients(callthemβboot),apply

tooriginal(w

hole)sample(X

orig,Y

orig)to

get

R2 orig=R2 (Xorigβboot,Y

orig)

–optimism

=R2 boot−R2 orig

–averageoverB

=100optimismsto

getoptimism

–R2 overfittingcorrected=R2 app−optimism

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

115

�Isestimatingunconditional(notconditionalonX)

distribution

ofR2 ,etc.[45,

p.217]

�Conditionalestimates

would

requireassumingthe

modeloneistrying

tovalidate

�Efron’s“.632”methodmay

perform

better(reduce

bias

further)forsm

alln40,[41,

p.253],42

Bootstrap

usefulforassessingcalibration

inaddition

todiscrimination:

�FitC(Y|X

)=Xβon

bootstrap

sample

�Re-fitC(Y|X

)=γ0+γ1X

βon

samedata

�γ0=0,γ1=1

�Testdata

(originaldataset):re-estimateγ0,γ1

�γ1<

1ifoverfit,γ0>

0to

compensate

�γ1quantifies

overfittinganduseful

forimproving

calibration

103

�UseEfron’smethodto

estimateoptimismin(0,1),

estimate(γ

0,γ1)

bysubtractingoptimism

from

(0,1)

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

116

�See

also

Copas

30andvanHouwelingenandle

Cessie[111,p.

1318]

See

[47]

forwarningsaboutthebootstrap,and[40]

forvariations

onthebootstrap

toreduce

bias.

Use

bootstrap

tochoose

betweenfullandreduced

models:

�Bootstrap

estimateof

accuracy

forfullmodel

�Repeat,

usingchosen

stopping

rule

foreach

re-

sample

�Fullfitusually

outperform

sreducedmodel103

�Stepw

isemodelingoftenreducesoptimismbutthis

isnotoff

setby

loss

ofinform

ationfrom

deleting

marginalvar.

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

117

Method

ApparentRank

Over-

Bias-Corrected

Correlation

ofOptimism

Correlation

Predictedvs.

Observed

FullModel

0.50

0.06

0.44

Stepw

iseModel

0.47

0.05

0.42

Inthis

exam

ple,

stepwisemodelinglost

apossible

0.50−

0.47

=0.03

predictive

discrimination.

The

fullmodelfitwillespecially

bean

improvem

entwhen

1.The

stepwiseselectiondeletedseveralvariables

which

werealmostsignificant.

2.These

marginalvariables

have

somerealpredictive

value,even

ifit’sslight.

3.There

isno

smallsetof

extrem

elydominantvari-

ablesthat

would

beeasily

foundby

stepwisese-

lection.

Other

issues:

�See

[111]formanyinterestingideas

�Faraw

ay45

show

showbootstrap

isused

topenal-

izeforchoosing

transformations

forY,outlierand

influencechecking,variableselection,

etc.

simul-

taneously

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

118

�Brownstone

[20,p.

74]feelsthat“theoreticalstatis-

ticianshave

beenunableto

analyzethesampling

properties

of[usual

multi-stepmodelingstrate-

gies]underrealisticconditions”andconcludesthat

themodelingstrategy

mustbecompletelyspeci-

fied

andthen

bootstrapped

togetconsistent

esti-

mates

ofvariancesandothersamplingproperties

�See

BlettnerandSauerbrei13

andChatfield24

for

moreinterestingexam

ples

ofproblemsresulting

from

data-drivenanalyses.

4.4

Sim

plifyingth

eFinalM

odelbyApproxim

atingIt

4.4.1

Diffi

cultiesUsingFullM

odels

�Predictions

areconditionalon

allvariables,stan-

dard

errors↑whenpredictforalow-frequency

cat-

egory

�Collinearity

�Can

averagepredictionsovercategoriesto

marginal-

ize,↓s.e.

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

119

4.4.2

Approxim

atingth

eFullM

odel

�Fullmodelisgold

standard

�Approximateitto

anydesireddegree

ofaccuracy

�Ifapprox.withatree,bestc-vtree

will

have

1obs./node

�Can

useleastsquaresto

approx.modelby

predict-

ingY

=Xβ

�Whenoriginal

model

also

fitusingleastsquares,

coef.ofapprox.m

odelagainstY≡

coef.ofsubset

ofvariablesfitted

againstY

(asin

stepwise)

�Modelapproximationstillhassomeadvantages

1.Usesunbiased

estimateof

σfrom

fullfit

2.Stoppingruleless

arbitrary

3.Inheritanceof

shrinkage

�If

estimates

from

fullmodel

areβ

andapprox.

model

isbasedon

asubset

Tof

predictors

X,

coef.of

approx.modelareW

β,where

W=(T′ T)−

1 T′ X

�Variancematrixof

reducedcoef.:W

VW′

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

120

4.5

How

DoW

eBreakBad

Habits?

�Insist

onvalidationof

predictive

modelsanddis-

coveries

�Showcollaboratorsthat

split-sam

plevalidationis

notappropriateunless

thenumber

ofsubjects

ishuge

–Splitmorethan

once

andseevolatileresults

–Calculateaconfidenceintervalforthepredictive

accuracy

inthetest

datasetandshow

that

itis

very

wide

�Run

simulationstudywithno

realassociations

and

show

that

associations

areeasy

tofind

�Analyze

thecollaborator’sdataafterrandom

lyper-

mutingtheY

vector

andshow

somepositivefind-

ings

�Showthatalternativeexplanations

areeasy

toposit

–Im

portanceof

arisk

factor

may

disappearif5

“unimportant”risk

factorsareaddedback

tothe

model

CHAPTER

4.

DESCRIB

ING,RESAMPLIN

G,VALID

ATIN

G,AND

SIM

PLIF

YIN

GTHE

MODEL

121

–Omittedmaineffects

canexplainapparent

in-

teractions

Chapter5

SSoftware

Sallowsinteractionsplinefunctions,widevarietyof

predictorparameterizations,widevarietyof

models,

unifying

model

form

ulalanguage,model

validation

byresampling.

Siscomprehensive:

�Easyto

write

Sfunctionsfornewmodels→

wide

varietyof

modernregression

modelsimplem

ented

(trees,nonparam

etric,ACE,AVAS,survivalmod-

elsformultipleevents)

�Designs

canbegeneratedforanymodel→

all

handle“class”var,

interactions,nonlinearexpan-

122

CHAPTER

5.

SSOFTWARE

123

sions

�SingleSobjects(e.g.,fitobject)canbeself-docum

enting

→automatichypothesistests,predictionsfornew

data

�Superiorgraphics

�Classes

andgenericfunctions

5.1

TheS

ModelingLanguage

Sstatisticalmodelinglanguage:

response∼

term

sy∼

age

+sex

#age

+sex

main

effects

y∼

age

+sex

+age:sex

#add

second-order

interaction

y∼

age*sex

#second-order

interaction

+

#all

main

effects

y∼

(age

+sex

+pre

ssure

)∧2

#age+sex+pressure+age:sex+age:pressure...

y∼

(age

+sex

+pre

ssure

)∧2−

sex:pre

ssure

#all

main

effects

and

all

2nd

order

#interactions

except

sex:pressure

y∼

(age

+ra

ce)*sex

#age+race+sex+age:sex+race:sex

y∼

treatm

ent*(age*ra

ce

+age*sex)

#no

interact.

with

race,sex

sqrt

(y)∼

sex*sqrt

(age)+

race

#functions,

with

dummy

variables

generated

if

#race

is

an

Sfactor

(classification)

variable

y∼

sex

+poly

(age,2

)#

poly

generates

orthogonal

polynomials

race.sex←

interactio

n(ra

ce,sex)

y∼

age

+ra

ce.sex

#for

when

you

want

dummy

variables

for

#all

combinations

of

the

factors

CHAPTER

5.

SSOFTWARE

124

The

form

ulaforaregression

modelisgivento

amod-

elingfunction,e.g.

lrm(y∼

rcs(x,4

))

isread

“use

alogisticregression

modelto

modelyas

afunction

ofx,

representing

xby

arestricted

cubic

splinewith4defaultknots”a .

updatefunction:re-fitmodelwithchangesin

term

sor

data:

f←

lrm(y∼

rcs(x,4

)+

x2

+x3)

f2←

update

(f,

subset=

sex=

=”male”)

f3←

update

(f,

.∼.−

x2)

#remove

x2

from

model

f4←

update

(f,

.∼.+

rcs(x5,5

))#

add

rcs(x5,5)

to

model

f5←

update

(f,

y2∼

.)

#same

terms,

new

response

var.

5.2

User-Contributed

Functions

�Sishigh-levelobject-orientedlanguage.

�S-Plus(U

NIX,Linux,MicrosoftWindows)

�R(U

NIX,Linux,Mac,Windows)

�Multitude

ofuser-contributed

functionsfreelyavail-

able

�Internationalcommunityof

users

alrmand

rcsare

inth

ermspackage.

CHAPTER

5.

SSOFTWARE

125

Som

eSfunctions:

�See

VenablesandRipley

�Hierarchicalclustering:hclust

�Principalcomponents:princomp,

prcomp

�Canonicalcorrelation:

cancor

�Nonparametrictransform-both-sidesadditive

mod-

els:

ace,

avas

�Param

etrictransform-both-sidesadditive

models:

areg,areg.boot(Hmiscpackagein

R,S-Plus))

�Rankcorrelationmethods:

rcorr,hoeffd,spearman2(Hmisc)

�Variableclustering:varclus(Hmisc)

�Singleimputation:transcan(Hmisc)

�Multipleimputation:aregImpute(Hmisc)

�Restrictedcubicsplines:

rcspline.eval(Hmisc)

�Re-staterestricted

splinein

simpler

form

:rcspline.restate(Hmisc)

CHAPTER

5.

SSOFTWARE

126

5.3

The

rmsPack

age

�datadistfunction

tocompute

predictordistribu-

tion

summaries

y∼

sex

+lsp(age,c(20,3

0,4

0,5

0,6

0))

+sex

%ia%

lsp(age,c(20,3

0,4

0,5

0,6

0))

E.g.restrict

age×

cholesterolinteractionto

beof

form

AF(B

)+BG(A

):y∼

lsp(age,3

0)

+rc

s(cholestero

l,4

)+

lsp(age,3

0)%ia%

rcs(cholestero

l,4

)

Specialfittingfunctionsby

Harrellto

simplifyproce-

duresdescribed

inthesenotes:

Table

5.1:

rmsFittingFunctions

Function

Purpose

Related

SFunctions

ols

Ordinaryleastsquares

linearmodel

lm

lrm

Binaryan

dordinal

logistic

regression

model

glm

Has

option

sforpenalized

MLE

psm

Accelerated

failure

timeparam

etricsurvival

survreg

models

cph

Cox

proportion

alhazardsregression

coxph

bj

Buckley-Jam

escensoredleastsquares

model

survreg,lm

Glm

rmsversionof

glm

glm

Gls

rmsversionof

gls

gls(nlmepackage)

Rq

rmsversionof

rq

rq(quantregpackage)

CHAPTER

5.

SSOFTWARE

127

Table

5.2:

rmsTransform

ationFunctions

Function

Purpose

Related

SFunctions

asis

Nopost-tran

sformation(seldom

usedexplicitly)

I

rcs

Restrictedcubic

splines

ns

pol

Polynom

ialusingstan

dardnotation

poly

lsp

Linearspline

catg

Categorical

predictor(seldom

)factor

scored

Ordinal

categoricalvariab

les

ordered

matrx

Keepvariab

lesas

grou

pforanovaan

dfastbw

matrix

strat

Non

-modeled

stratification

factors

strata

(usedforcphon

ly)

Function

Purpose

Related

Functions

print

Printparam

etersandstatistics

offit

coef

Fittedregression

coeffi

cients

formula

Formula

usedin

thefit

specs

Detailedspecificationsof

fit

vcov

Fetch

covariance

matrix

logLik

Fetch

maxim

ized

log-likelihood

AIC

Fetch

AIC

withoption

toputon

chi-squarebasis

lrtest

Likelihoodratiotest

fortwonestedmodels

univarLR

Com

pute

allunivariableLRχ2

robcov

Robust

covariance

matrixestimates

bootcov

Bootstrapcovariance

matrixestimates

andbootstrapdistributionsof

estimates

pentrace

Findoptimum

penalty

factorsby

tracing

effective

AIC

foragrid

ofpenalties

effective.df

Printeff

ective

d.f.foreach

typeof

variable

inmodel,forpenalized

fitor

pentraceresult

summary

Summaryof

effects

ofpredictors

plot.summary

Plotcontinuouslyshaded

confidence

bars

forresultsof

summary

anova

Waldtestsof

mostmeaningfulhypotheses

plot.anova

Graphical

depiction

ofanova

contrast

General

contrasts,C.L.,tests

gendata

Easily

generatepredictorcombinations

predict

Obtain

predictedvalues

ordesignmatrix

Predict

Obtain

predictedvalues

andconfidence

limitseasily

varyingasubsetof

predictors

andotherssetat

defaultvalues

plot.Predict

Ploteff

ects

ofpredictors

fastbw

Fastbackw

ardstep-dow

nvariableselection

step

residuals

(orresid)Residuals,influence

statsfrom

fit

sensuc

Sensitivity

analysisforunmeasuredconfounder

which.influence

Whichobservationsareoverly

influential

residuals

latex

LATEX

representation

offitted

model

Function

Function

Sfunctionanalyticrepresentation

ofXβ

latex

from

afitted

regression

model

CHAPTER

5.

SSOFTWARE

128

Function

Purpose

Related

Functions

Hazard


ofafitted

hazardfunction(for

psm)

Survival


offitted

survival

function(for

psm,cph)

Quantile


offitted

functionforquantilesof

survival

time

(for

psm,cph)

Mean


offitted

functionformeansurvival

timeor

forordinal

logistic

nomogram

Drawsanom

ogram

forthefitted

model

latex,plot

survest

Estim

atesurvival

probabilities

(psm,cph)

survfit

survplot

Plotsurvival

curves

(psm,cph)

plot.survfit

validate

Validateindexes

ofmodelfitusingresampling

val.prob

External

validationof

aprobability

model

lrm

val.surv

External

validationof

asurvival

model

calibrate

calibrate

Estim

atecalibration

curveusingresampling

val.prob

vif

Variance

inflationfactorsforfitted

model

naresid

Bringelem

ents

correspondingto

missingdata

backinto

predictionsandresiduals

naprint

Printsummaryof

missingvalues

impute

Impute

missingvalues

aregImpute

Example:

�treat:categoricalvariablewithlevels"a","b","c"

�num.diseases:ordinalvariable,0-4

�age:continuous

Restrictedcubicspline

�cholesterol:continuous

(3missings;usemedian)

log(cholesterol+10)

�Allow

treat×

cholesterolinteraction

CHAPTER

5.

SSOFTWARE

129

�Program

tofitlogistic

model,test

alleffects

indesign,estimateeffects

(e.g.inter-quartile

range

odds

ratios),plot

estimated

transformations

require(rm

s)

#make

new

functions

available

ddist←

datadist(cholestero

l,

treat,

num.d

iseases,

age)

#Could

have

used

ddist←

datadist(data.frame.name)

options(datadist=

”ddist”)

#defines

data

dist.

to

rms

chole

sterol←

impute

(chole

sterol)

fit←

lrm(y∼

tre

at

+score

d(num.d

iseases)+

rcs(age)+

log(chole

sterol+

10)+

tre

at:log(chole

sterol+

10))

describe(y∼

tre

at

+score

d(num.d

iseases)+

rcs(age))

#or

use

describe(formula(fit))

for

all

variables

used

in

fit

#describe

function

(in

Hmisc)

gets

simple

statistics

on

variables

#fit←

robcov(fit)

#Would

make

all

statistics

that

follow

#use

arobust

covariance

matrix

#would

need

x=T,

y=T

in

lrm()

specs(fit

)#

Describe

the

design

characteristics

anova(fit

)anova(fit,

treat,

chole

sterol)

#Test

these

2by

themselves

plo

t(anova(fit

))

#Summarize

anova

graphically

summary(fit

)#

Estimate

effects

using

default

ranges

plo

t(summary(fit

))

#Graphical

display

of

effects

with

C.I.

summary(fit,

tre

at=

”b”,

age=60)

#Specify

reference

cell

and

adjustment

val

summary(fit,

age=

c(50,7

0))

#Estimate

effect

of

increasing

age

from

#50

to

70

summary(fit,

age=

c(50,6

0,7

0))

#Increase

age

from

50

to

70,

adjust

to

#60

when

estimating

effects

of

other

#factors

#If

had

not

defined

datadist,

would

have

to

define

ranges

for

all

var.

#Estimate

and

test

treatment

(b-a)

effect

averaged

over

3cholesterols

contra

st(fit,

list(tre

at=

'b',

chole

sterol=

c(150,2

00,2

50)),

list(tre

at=

'a',

chole

sterol=

c(150,2

00,2

50)),

type=

'avera

ge

')

#See

the

help

file

for

contrast.rms

for

several

examples

of

#how

to

obtain

joint

tests

of

multiple

contrasts.

p←

Pre

dict(fit,

age=

seq(20,8

0,length

=100),

treat,

conf.in

t=FALSE)

plo

t(p)

#Plot

relationship

between

age

and

log

#odds,

separate

curve

for

each

treat,

#no

C.I.

plo

t(p,∼

age|

tre

at)

#Same

but

2panels

bplo

t(Pre

dict(fit,

age,

cholestero

l,

np=50))

CHAPTER

5.

SSOFTWARE

130

#3-dimensional

perspective

plot

for

age,

#cholesterol,

and

log

odds

using

default

#ranges

for

both

variables

plo

t(Pre

dict(fit,

num.d

iseases,

fun=fu

nctio

n(x)

1/(1+exp(−

x)),

conf.in

t=.9

),

ylab=”Prob”)

#Plot

estimated

probabilities

instead

of

#log

odds

#Again,

if

no

datadist

were

defined,

would

have

to

tell

plot

all

limits

logit←

pre

dict(fit,

expand.g

rid(tre

at=

”b”,n

um.d

is=1:3

,age=

c(20,4

0,6

0),

chole

sterol=

seq(1

00,300,length

=10)))

#Could

also

obtain

list

of

predictor

settings

interactively}

logit←

pre

dict(fit,

gendata

(fit,

nobs=12))

#Since

age

doesn't

interact

with

anything,

we

can

quickly

and

#interactively

try

various

transformations

of

age,

taking

the

spline

#function

of

age

as

the

gold

standard.

We

are

seeking

alinearizing

#transformation.

ag←

10:80

logit←

pre

dict(fit,

expand.g

rid(tre

at=

”a”,

num.d

is=0,age=

ag,

chole

sterol=

median(chole

sterol)),

type=

”te

rms”)[,”age”]

#Note:

if

age

interacted

with

anything,

this

would

be

the

age

#"main

effect"

ignoring

interaction

terms

#Could

also

use

#logit←

Predict(f,

age=ag,

...)$yhat,

#which

allows

evaluation

of

the

shape

for

any

level

of

interacting

#factors.

When

age

does

not

interact

with

anything,

the

result

from

#predict(f,

...,

type="terms")

would

equal

the

result

from

#Predict

if

all

other

terms

were

ignored

#Could

also

specify

#logit←

predict(fit,

gendata(fit,

age=ag,

cholesterol=...))

#Un-mentioned

variables

set

to

reference

values

plo

t(ag∧.5

,lo

git

)#

try

square

root

vs.

spline

transform.

plo

t(ag∧1.5

,lo

git

)#

try

1.5

power

latex(fit

)#

invokes

latex.lrm,

creates

fit.tex

#Draw

anomogram

for

the

model

fit

plo

t(nomogram(fit

))

#Compose

Sfunction

to

evaluate

linear

predictors

analytically

g←

Function(fit

)g(tre

at=

'b',

chole

sterol=

260,

age=50)

#Letting

num.diseases

default

to

reference

value

Toexam

ineinteractions

inasimpler

way,youmay

wantto

groupageinto

tertiles:

CHAPTER

5.

SSOFTWARE

131

age.t

ertile←

cut2

(age,

g=3)

#For

automatic

ranges

later,

add

age.tertile

to

datadist

input

fit←

lrm(y∼

age.t

ertile

*rc

s(chole

sterol))

5.4

Oth

erFunctions

�supsmu:Friedman’s“super

smoother”

�lowess:Cleveland’sscatterplotsm

oother

�glm:generalized

linearmodels(see

Glm)

�gam:Generalized

additive

models

�rpart:Likeoriginal

CART

withsurrogatesplits

formissings,censored

data

extension(Atkinson&

Therneau)

�validate.rpart:in

rms;validates

recursiveparti-

tioningwithrespectto

certainaccuracy

indexes

�loess:multi-dimensionalscatterplotsm

oother

f←

loess(y∼

age

*pre

ssure

)plo

t(f)

#cross-sectional

plots

ages←

seq(20,7

0,length

=40)

pre

ssure

s←

seq(80,2

00,length

=40)

pred←

pre

dict(f,

expand.g

rid(age=

ages,

pre

ssure=pre

ssure

s))

persp(ages,

pre

ssure

s,

pred)

#3-d

plot

Chapter6

LogisticM

odelCase

Stu

dy:Surv

ivalof

Titanic

Passengers

Data

source:TheTitanic

Passenger

Listedited

byMichaelA.Findlay,

originally

published

inEaton

&Haas(1994)

Titanic:TriumphandTragedy,

Patrick

Stephens

Ltd,andexpanded

withthehelpof

theInternet

community.

Theoriginal

htmlfileswere

obtained

from

PhilipHind(1999)

(http://atschool.eduweb.co.uk/phind).

Thedataset

was

compiledandinterpretedby

Thom

asCason.Itisavailablein

R,S-Plus,andExcel

form

atsfrom

biostat.mc.vanderbilt.edu/DataSetsunder

thenam

etitanic3.

6.1

DescriptiveStatistics

require(rm

s)

getH

data

(titanic

3)

#get

dataset

from

web

site

units(titanic

3$age)←

'years

'

#List

of

names

of

variables

to

analyze

v←

c('pcla

ss','surv

ived

','age','sex

','sib

sp

','parch

')

132

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

133

latex(describe(titanic

3[,v

]),

file=

'')

titanic3[,

v]

6Variables

1309

Observations

pclass n

missing

unique

1309

03

1st

(323,25%),

2nd

(277,21%),

3rd

(709,54%)

survived:Survived

nmissing

unique

Sum

Mean

1309

02

500

0.382

age:Age[years]

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

1046

263

98

29.88

514

21

28

39

50

57

lowest

:0.1667

0.3333

0.4167

0.6667

0.7500

highest:70.500071.0000

74.0000

76.0000

80.0000

sex

nmissing

unique

1309

02

female

(466,

36%),

male

(843,

64%)

sibsp

:NumberofSiblings/SpousesAboard

nmissing

unique

Mean

1309

07

0.4989

01

23

45

8Frequency89131942

20

226

9%

68

24

32

20

1

parch

:NumberofParents/ChildrenAboard

nmissing

unique

Mean

1309

08

0.385

01

23

45

69

Frequency1002

170

113

866

22

%77

13

91

00

00

dd←

datadist(titanic

3[,v

])#

describe

distributions

of

variables

to

rms

options(datadist=

'dd

')

attach(titanic

3[,v

])options(dig

its=2)

s←

summary(surv

ived∼

age

+sex

+pcla

ss

+cut2

(sib

sp,0

:3)

+cut2

(parch,0

:3))

latex(s,

file=

'',

label=

'tita

nic−summary

.table

')

#create

LATEX

code

for

Table

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

134

Table

6.1:

Survived

N=1309

Nsurvived

Age

years

[0.167,22.0)

290

0.43

[22.000,28.5)

246

0.39

[28.500,40.0)

265

0.42

[40.000,80.0]

245

0.39

Missing

263

0.28

sex female

466

0.73

male

843

0.19

pclass

1st

323

0.62

2nd

277

0.43

3rd

709

0.26

NumberofSiblings/

SpousesAboard

0891

0.35

1319

0.51

242

0.45

[3,8]

57

0.16

NumberofParents/Childre

nAboard

01002

0.34

1170

0.59

2113

0.50

[3,9]

24

0.29

Overa

ll1309

0.38

plo

t(s,

main=

'',

subtitle

s=FALSE)

#convert

table

to

dot

plot

(Figure

6.1)

Show4-way

relationshipsaftercollapsinglevels.Sup-

pressestimates

basedon

<25

passengers.

agec←

ifels

e(age<

21,'child

','adult

')

sib

sp.p

arc

h←

paste

(ifels

e(sib

sp==0,'no

sib

/sp

ouse

','sib

/sp

ouse

'),

ifels

e(parc

h==0,'no

pare

nt/child

','pare

nt/child

'),

sep=

'/

')

g←

functio

n(y)

if(length

(y)<

25)

NA

else

mean(y)

s←

summarize(surv

ived,

llis

t(agec,

sex,

pclass,

sib

sp.p

arc

h),g

)#

llist,

summarize,

Dotplot

in

Hmisc

package

require(la

ttic

e)

#trellis

for

S-Plus

##

To

remove

color

background

from

strip

labels

do

the

following:

##

ltheme←

canonical.theme(color

=FALSE)

##

ltheme$strip.background$col←

"transparent"

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

135

Sur

vive

d

0.2

0.3

0.4

0.5

0.6

0.7

●●

●●

●

●●

●●

●

●●

●●

●●

●●

●

290

246

265

245

263

466

843

323

277

709

891

319

42

57

100

2 1

70 1

13

24

130

9 N

Mis

sing

fem

ale

m

ale

1st

2nd

3r

d

0

1

2

0

1

2

[ 0.1

67,2

2.0)

[2

2.00

0,28

.5)

[28.

500,

40.0

) [4

0.00

0,80

.0]

[3,8

]

[3,9

]

Ag

e [y

ears

]

sex

pcl

ass

Nu

mb

er o

f S

iblin

gs/

Sp

ou

ses

Ab

oar

d

Nu

mb

er o

f P

aren

ts/C

hild

ren

Ab

oar

d

Ove

rall

Figure

6.1:

Univariable

summaries

ofTitanic

survival

##

lattice.options(default.theme

=ltheme)

##

set

as

default

i←

s$agec

!='NA

'

prin

t(Dotp

lot(pcla

ss∼

surv

ived|

sib

sp.p

arc

h*agec,

gro

ups=

sex[i],

data=s,

subset=

i,

pch=

c(1,4

),

col=

c(1,1

),

xlab=

'Pro

portion

Surv

ivin

g',

par.s

trip

.text=

list(cex=

.6)))

#Figure

6.2

Key

(.0

7)

6.2

ExploringTrendswith

Nonpara

metric

Regression

#Figure

6.3

plsmo(age,

surv

ived,

datadensity=TRUE)

plsmo(age,

surv

ived,

gro

up=se

x,

datadensity=TRUE)

plsmo(age,

surv

ived,

gro

up=pclass,

datadensity=TRUE)

plsmo(age,

surv

ived,

gro

up=in

teractio

n(pclass,sex),

datadensity=TRUE,

lty=c(1,1

,1,2

,2,2

))

#Figure

6.4

plsmo(age,

surv

ived,

gro

up=cut2

(sib

sp,0

:2),

datadensity=TRUE)

plsmo(age,

surv

ived,

gro

up=cut2

(parch,0

:2),

datadensity=TRUE)

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

136

Pro

port

ion

Sur

vivi

ng

pclass

1st

2nd

3rd

0.0

0.2

0.4

0.6

0.8

1.0

no s

ib/s

pous

e / n

o pa

rent

/chi

ldad

ult

no s

ib/s

pous

e / p

aren

t/chi

ldad

ult

0.0

0.2

0.4

0.6

0.8

1.0

sib/

spou

se /

no p

aren

t/chi

ldad

ult

sib/

spou

se /

pare

nt/c

hild

adul

t

no s

ib/s

pous

e / n

o pa

rent

/chi

ldch

ild

0.0

0.2

0.4

0.6

0.8

1.0

no s

ib/s

pous

e / p

aren

t/chi

ldch

ildsi

b/sp

ouse

/ no

par

ent/c

hild

child

0.0

0.2

0.4

0.6

0.8

1.0

1st

2nd

3rd

sib/

spou

se /

pare

nt/c

hild

child

fem

ale

mal

e

Figure

6.2:

Multi-way

summary

ofTitanic

survival

6.3

Binary

LogisticM

odelwith

CasewiseDeletion

ofM

issingValues

Firstfitamodelthat

issaturatedwithrespectto

age,

sex,

pclass.Insufficientvariationin

sibsp,parchto

fitcomplex

interactions

ornonlinearities.

f1←

lrm(surv

ived∼

sex*pcla

ss*rc

s(age,5

)+

rcs(age,5

)*(sib

sp

+parc

h))

latex(anova(f1

),

file=

'',

label=

'tita

nic−anova3

')

#Table

6.2

3-way

interactions,p

archclearlyinsignificant,sodrop

f←

lrm(surv

ived∼

(sex

+pcla

ss

+rc

s(age,5

))∧2

+rc

s(age,5

)*sib

sp)

prin

t(f,

latex=TRUE)

LogisticRegressionModel

lrm(formula=

survived~(sex

+pclass

+rcs(age,5))^2+

rcs(age,

5)

*sibsp)

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

137

010

2030

4050

60

0.400.500.60

Age

, yea

rs

Survived

010

2030

4050

60

0.20.40.60.8

Age

, yea

rs

Survived

fem

ale

mal

e

010

2030

4050

60

0.20.40.60.81.0

Age

, yea

rs

Survived

1st

2nd

3rd

020

4060

80

0.00.20.40.60.81.0

Age

, yea

rs

Survived

1st.f

emal

e

2nd.

fem

ale

3rd.

fem

ale 1s

t.mal

e

2nd.

mal

e

3rd.

mal

e

Figure

6.3:

Nonparametricregression

(loess)estimatesoftherelationship

between

ageand

theprobabilityof

survivingtheTitanic.Thetopleft

panel

show

sunstratified

estimates.

Thetoprightpanel

depicts

relationships

stratified

bysex.Thebottom

left

andrightpanelsshow

respectivelyestimatesstratified

byclass

andbythecross-

classificationofsexandclass

ofthepassenger.Tickmarksare

drawnatactualagevalues

foreach

strata.

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

138

020

4060

0.20.40.60.8

Age

, yea

rs

Survived

0

1

[2,8

]

020

4060

0.350.500.65

Age

, yea

rs

Survived

0

1

[2,9

]

Figure

6.4:

Relationship

betweenageandsurvivalstratified

bythenumber

ofsiblingsorspousesonboard

(left

panel)orbythenumber

ofparents

orchildrenofthepassenger

onboard

(rightpanel)

Table

6.2:

Wald

Statisticsforsurvived

χ2

d.f.

P

sex(Factor+

Higher

Order

Factors)

187.15

15

<0.0001

AllInteractions

59.74

14

<0.0001

pclass

(Factor+

Higher

Order

Factors)

100.10

20

<0.0001

AllInteractions

46.51

18

0.0003

age(Factor+

Higher

Order

Factors)

56.20

32

0.0052

AllInteractions

34.57

28

0.1826

Nonlinear(Factor+

Higher

Order

Factors)

28.66

24

0.2331

sibsp

(Factor+

Higher

Order

Factors)

19.67

50.0014

AllInteractions

12.13

40.0164

parch(Factor+

Higher

Order

Factors)

3.51

50.6217

AllInteractions

3.51

40.4761

sex×

pclass

(Factor+

Higher

Order

Factors)

42.43

10

<0.0001

sex×

age(Factor+

Higher

Order

Factors)

15.89

12

0.1962

Nonlinear(Factor+

Higher

Order

Factors)

14.47

90.1066

NonlinearInteraction:f(A,B

)vs.AB

4.17

30.2441

pclass×

age(Factor+

Higher

Order

Factors)

13.47

16

0.6385

Nonlinear(Factor+

Higher

Order

Factors)

12.92

12

0.3749


)vs.AB

6.88

60.3324

age×

sibsp

(Factor+

Higher

Order

Factors)

12.13

40.0164

Nonlinear

1.76

30.6235


)vs.AB

1.76

30.6235

age×

parch(Factor+

Higher

Order

Factors)

3.51

40.4761

Nonlinear

1.80

30.6147


)vs.AB

1.80

30.6147

sex×

pclass×

age(Factor+

Higher

Order

Factors)

8.34

80.4006

Nonlinear

7.74

60.2581

TOTAL

NONLIN

EAR

28.66

24

0.2331

TOTAL

INTERACTIO

N75.61

30

<0.0001

TOTAL

NONLIN

EAR

+IN

TERACTIO

N79.49

33

<0.0001

TOTAL

241.93

39

<0.0001

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

139

FrequenciesofMissingValuesDuetoEach

Variable

survived

sex

pclass

age

sibsp

00

0263

0

ModelLikelihood

Discrim

ination

RankDiscrim

.Ratio

Test

Indexes

Indexes

Obs

1046

LRχ2

553.87

R2

0.555

C0.878

0619

d.f.

26g

2.427

Dxy

0.756

1427

Pr(>

χ2)<

0.0001

g r11.325

γ0.758

max|deriv|6×10

−6

g p0.365

τ a0.366

Brier

0.130

Coef

S.E.

WaldZ

Pr(>|Z|)

Intercept

3.3075

1.8427

1.79

0.0727

sex=

male

-1.1478

1.0878

-1.06

0.2914

pclass=

2nd

6.7309

3.9617

1.70

0.0893

pclass=

3rd

-1.6437

1.8299

-0.90

0.3691

age

0.0886

0.1346

0.66

0.5102

age’

-0.7410

0.6513

-1.14

0.2552

age”

4.9264

4.0047

1.23

0.2186

age”’

-6.6129

5.4100

-1.22

0.2216

sibsp

-1.0446

0.3441

-3.04

0.0024

sex=

male*pclass=

2nd

-0.7682

0.7083

-1.08

0.2781

sex=

male*pclass=

3rd

2.1520

0.6214

3.46

0.0005

sex=

male*age

-0.2191

0.0722

-3.04

0.0024

sex=

male*age’

1.0842

0.3886

2.79

0.0053

sex=

male*age”

-6.5578

2.6511

-2.47

0.0134

sex=

male*age”’

8.3716

3.8532

2.17

0.0298

pclass=

2nd*age

-0.5446

0.2653

-2.05

0.0401

pclass=

3rd*age

-0.1634

0.1308

-1.25

0.2118

pclass=

2nd*age’

1.9156

1.0189

1.88

0.0601

pclass=

3rd*age’

0.8205

0.6091

1.35

0.1780

pclass=

2nd*age”

-8.9545

5.5027

-1.63

0.1037

pclass=

3rd*age”

-5.4276

3.6475

-1.49

0.1367

pclass=

2nd*age”’

9.3926

6.9559

1.35

0.1769

pclass=

3rd*age”’

7.5403

4.8519

1.55

0.1202

age*sibsp

0.0357

0.0340

1.05

0.2933

age’

*sibsp

-0.0467

0.2213

-0.21

0.8330

age”

*sibsp

0.5574

1.6680

0.33

0.7382

age”’*sibsp

-1.1937

2.5711

-0.46

0.6425

latex(anova(f),

file=

'',

label=

'tita

nic−anova2

')

#Table

6.3

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

140

Table

6.3:

Wald


χ2

d.f.

P

sex(Factor+

Higher

Order

Factors)

199.42

7<

0.0001

AllInteractions

56.14

6<

0.0001

pclass

(Factor+

Higher

Order

Factors)

108.73

12

<0.0001

AllInteractions

42.83

10

<0.0001

age(Factor+

Higher

Order

Factors)

47.04

20

0.0006

AllInteractions

24.51

16

0.0789

Nonlinear(Factor+

Higher

Order

Factors)

22.72

15

0.0902

sibsp

(Factor+

Higher

Order

Factors)

19.95

50.0013

AllInteractions

10.99

40.0267

sex×

pclass

(Factor+

Higher

Order

Factors)

35.40

2<

0.0001

sex×

age(Factor+

Higher

Order

Factors)

10.08

40.0391

Nonlinear

8.17

30.0426


)vs.AB

8.17

30.0426

pclass×

age(Factor+

Higher

Order

Factors)

6.86

80.5516

Nonlinear

6.11

60.4113


)vs.AB

6.11

60.4113

age×

sibsp

(Factor+

Higher

Order

Factors)

10.99

40.0267

Nonlinear

1.81

30.6134


)vs.AB

1.81

30.6134

TOTAL

NONLIN

EAR

22.72

15

0.0902

TOTAL

INTERACTIO

N67.58

18

<0.0001

TOTAL

NONLIN

EAR

+IN

TERACTIO

N70.68

21

<0.0001

TOTAL

253.18

26

<0.0001

Showthemanyeffects

ofpredictors.

p←

Pre

dict(f,

age,

pclass,

sex,

fun=plo

gis

)plo

t(p,

adj.subtitle=FALSE)

#Fig.

6.5

#To

take

control

of

panel

vs

groups

assignment

use:

#plot(p,∼

age

|sex,

groups='pclass',

adj.subtitle=FALSE)

plo

t(Pre

dict(f,

sib

sp,

age=

c(10,1

5,2

0,5

0),

conf.in

t=FALSE))

#Fig.

6.6

Notethat

childrenhaving

manysiblings

apparently

hadlowersurvival.Married

adultshadslightlyhigher

survivalthan

unmarried

ones.

Validatethemodel

usingthebootstrap

tocheck

overfitting.

Ignoring

twovery

insignificant

pooled

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

141

Age

, yea

rs

0.2

0.4

0.6

0.8

020

4060fe

mal

e

mal

e

1st

020

4060

fem

ale

mal

e

2nd

020

4060

fem

ale

mal

e

3rd

Figure

6.5:

Effects

ofpredictors

onprobabilityofsurvivalofTitanic

passengers,

estimatedforzero

siblingsor

spouses.

Lines

forfemalesare

black;malesare

drawnusinggrayscale.

Ad

just

ed t

o:s

ex=m

ale

pcl

ass=

3rd

N

umbe

r of

Sib

lings

/Spo

uses

Abo

ard

log odds

−5

−4

−3

−2

−1

02

46

8

age:

10ag

e:15

age:

20

02

46

8

−5

−4

−3

−2

−1

age:

50

Figure

6.6:

Effectofnumber

ofsiblingsandspousesonthelogoddsofsurviving,forthirdclass

males.

Numbers

nextto

lines

are

ages

inyears.

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

142

tests.

f←

update

(f,

x=TRUE,

y=TRUE)

#x=TRUE,

y=TRUE

adds

raw

data

to

fit

object

so

can

bootstrap

set.seed(1

31)

#so

can

replicate

re-samples

latex(validate(f,B=80),

dig

its=2,

siz

e=

'Ssize

')

Index

Original

Training

Test

Optimism

Corrected

nSam

ple

Sam

ple

Sam

ple

Index

Dxy

0.76

0.77

0.74

0.03

0.72

80R

20.55

0.58

0.53

0.05

0.50

80Intercept

0.00

0.00

−0.09

0.09

−0.09

80Slope

1.00

1.00

0.86

0.14

0.86

80E

max

0.00

0.00

0.05

0.05

0.05

80D

0.53

0.56

0.49

0.07

0.46

80U

0.00

0.00

0.01

−0.01

0.01

80Q

0.53

0.56

0.49

0.08

0.45

80B

0.13

0.12

0.13

−0.01

0.14

80g

2.43

2.79

2.38

0.40

2.02

80g p

0.37

0.37

0.35

0.02

0.35

80

cal←

calibrate(f,B=80)

#Figure

6.7

plo

t(cal)

n=1046

Mean

absolu

te

error=0.012

Mean

square

derror=0.00018

0.9

Quantile

of

absolu

te

error=0.018

But

moderateproblem

withmissing

data

6.4

ExaminingM

issingData

Pattern

s

na.p

attern

s←

naclu

s(titanic

3)

require(rp

art

)#

Recursive

partitioning

package

who.na←

rpart

(is.n

a(age)∼

sex

+pcla

ss

+surv

ived

+sib

sp

+parch,

minbucket=15)

naplot(na.p

attern

s,

'na

per

var')

plo

t(na.p

attern

s)

options(dig

its=5)

plo

t(who.na,

marg

in=.1

);

text(who.na)

#Figure

6.8

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

143

0.0

0.2

0.4

0.6

0.8

1.0

0.00.20.40.60.81.0

Pre

dict

ed P

r{su

rviv

ed=

1}

Actual Probability

Mea

n ab

solu

te e

rror

=0.

012

n=10

46B

= 8

0 re

petit

ions

, boo

t

App

aren

tB

ias−

corr

ecte

dId

eal

Figure

6.7:

Bootstrapoverfitting-correctedloessnonparametriccalibrationcurveforcasewisedeletionmodel

plo

t(summary(is.n

a(age)∼

sex

+pcla

ss

+surv

ived

+sib

sp

+parc

h))

#Figure

6.9

m←

lrm(is.n

a(age)∼

sex

*pcla

ss

+surv

ived

+sib

sp

+parc

h)

prin

t(m

,la

tex=TRUE)


lrm(formula=

is.na(age)

~sex*

pclass+survived

+sibsp+

parch)

ModelLikelihood

Discrim

ination

RankDiscrim

.Ratio

Test

Indexes

Indexes

Obs

1309

LRχ2

114.99

R2

0.133

C0.703

FALSE

1046

d.f.

8g

1.015

Dxy

0.406

TRUE

263

Pr(>

χ2)<

0.0001

g r2.759

γ0.452

max|deriv|5×10

−6

g p0.126

τ a0.131

Brier

0.148

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

144

pcla

sssu

rviv

edna

me

sex

sibs

ppa

rch

ticke

tca

bin

boat

fare

emba

rked

age

hom

e.de

stbo

dy

0.0

0.2

0.4

0.6

0.8

Fra

ctio

n o

f N

As

in e

ach

Var

iab

le

Fra

ctio

n of

NA

s

boat

embarked

cabin

fare

ticket

parch

sibsp

age

body

home.dest

sex

name

pclass

survived

0.40.20.0

Fraction Missing

|pc

lass

=ab

parc

h>=

0.5

0.09

2

0.18

0.32

Figure

6.8:

Patternsofmissingdata.Upper

leftpanel

show

sthefractionofobservationsmissingoneach

predictor.

Upper

rightpanel

depicts

ahierarchicalcluster

analysisofmissingnesscombinations.

Thesimilarity

measure

show

nontheY-axisisthefractionofobservationsforwhichboth

variablesare

missing.Low

erleftpanel

show

stheresultof

recursivepartitioningforpredictingis.na(age).Therpartfunctionfoundonly

strongpatternsaccordingto

passenger

class.

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

145

is.n

a(ag

e)

0.0

0.2

0.4

0.6

0.8

1.0

●●

●●

●

●●

●● ●

●● ●

●

●●

●●

●● ●

●

●

466

843

323

277

709

809

500

891

319

42

20

22 6

9

100

2 1

70 1

13 8

6

6

2

2

130

9 N

fem

ale

m

ale

1st

2nd

3r

d

No

Ye

s

0

1

2

3

4

5

8

0

1

2

3

4

5

6

9

sex

pcl

ass

Su

rviv

ed

Nu

mb

er o

f S

iblin

gs/

Sp

ou

ses

Ab

oar

d

Nu

mb

er o

f P

aren

ts/C

hild

ren

Ab

oar

d

Ove

rall

mea

n

Figure

6.9:

Univariable

descriptionsofproportionofpassengerswithmissingage

Coef

S.E.

WaldZ

Pr(>|Z|)

Intercept

-2.2030

0.3641

-6.05

<0.0001

sex=

male

0.6440

0.3953

1.63

0.1033

pclass=

2nd

-1.0079

0.6658

-1.51

0.1300

pclass=

3rd

1.6124

0.3596

4.48

<0.0001

survived

-0.1806

0.1828

-0.99

0.3232

sibsp

0.0435

0.0737

0.59

0.5548

parch

-0.3526

0.1253

-2.81

0.0049

sex=

male*pclass=

2nd

0.1347

0.7545

0.18

0.8583

sex=

male*pclass=

3rd

-0.8563

0.4214

-2.03

0.0422

latex(anova(m

),

file=

'',

label=

'tita

nic−anova.n

a')

#Table

6.4

pclassand

parcharetheimportant

predictors

ofmissing

age.

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

146

Table

6.4:

Wald

Statisticsforis.na(age)

χ2

d.f.

P

sex(Factor+

Higher

Order

Factors)

5.61

30.1324

AllInteractions

5.58

20.0614

pclass

(Factor+

Higher

Order

Factors)

68.43

4<

0.0001

AllInteractions

5.58

20.0614

survived

0.98

10.3232

sibsp

0.35

10.5548

parch

7.92

10.0049

sex×

pclass

(Factor+

Higher

Order

Factors)

5.58

20.0614

TOTAL

82.90

8<

0.0001

6.5

Single

ConditionalM

ean

Imputation

First

try:

conditionalmeanimputation

Defaultsplinetransformationforagecaused

distri-

bution

ofimputedvalues

tobemuchdifferentfrom

non-imputedones;constrainto

linear

xtra

ns←

transcan(∼

I(age)+

sex

+pcla

ss

+sib

sp

+parch,

imputed=TRUE,

pl=FALSE,

pr=

FALSE,

data=titanic

3)

summary(xtra

ns)

transcan(x

=∼I(age)+

sex

+pcla

ss

+sib

sp

+parch,

imputed

=TRUE,

pr=

FALSE,

pl=

FALSE,

data

=titanic

3)

Iteratio

ns:

5

R2

achieved

inpredic

tin

geach

varia

ble

:

age

sex

pcla

ss

sib

sp

parc

h0.258

0.078

0.244

0.241

0.288

Adju

sted

R2:

age

sex

pcla

ss

sib

sp

parc

h0.254

0.074

0.240

0.238

0.285

Coeffic

ients

of

canonical

varia

tes

for

predic

tin

geach

(row)

varia

ble

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

147

age

sex

pcla

ss

sib

sp

parc

hage

0.8

9−6.13−1.81−2.77

sex

0.0

20.5

6−0.10−0.71

pcla

ss−0.08

0.2

6−0.07−0.25

sib

sp−0.02−0.04−0.07

0.8

7parch−0.03−0.29−0.22

0.7

5

Summary

of

imputed

valu

es

age

nmissin

gunique

Mean

.05

.10

.25

263

024

28.41

16.76

21.66

26.17

.50

.75

.90

.95

28.04

28.04

42.92

42.92

lowest

:7.563

9.425

14.617

16.479

16.687

hig

hest:

33.219

34.749

38.588

41.058

42.920

Start

ing

estim

ates

for

imputed

valu

es:

age

sex

pcla

ss

sib

sp

parc

h28

23

00

#Look

at

mean

imputed

values

by

sex,pclass

and

observed

means

#age.i

is

age,

filled

in

with

conditional

mean

estimates

age.i←

impute

(xtrans,

age,

data=titanic

3)

i←

is.imputed(age.i)

tapply

(age.i

[i],

list(sex[i],pcla

ss[i])

,mean)

1st

2nd

3rd

female

39.137

31.357

22.926

male

42.920

33.219

26.715

tapply

(age,

list(se

x,pcla

ss),

mean,

na.rm=TRUE)

1st

2nd

3rd

female

37.038

27.499

22.185

male

41.029

30.815

25.962

dd

←datadist(dd,

age.i)

f.si←

lrm(surv

ived∼

(sex

+pcla

ss

+rc

s(age.i

,5))∧2

+rc

s(age.i

,5)*sib

sp)

prin

t(f.si,

coefs=FALSE,

latex=TRUE)


lrm(formula=

survived~(sex

+pclass

+rcs(age.i,5))^2+rcs(age.i,

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

148

Table

6.5:

Wald


χ2

d.f.

P

sex(Factor+

Higher

Order

Factors)

245.53

7<

0.0001

AllInteractions

52.80

6<

0.0001

pclass

(Factor+

Higher

Order

Factors)

112.02

12

<0.0001

AllInteractions

36.77

10

0.0001

age.i(Factor+

Higher

Order

Factors)

49.25

20

0.0003

AllInteractions

25.53

16

0.0610

Nonlinear(Factor+

Higher

Order

Factors)

19.86

15

0.1772

sibsp

(Factor+

Higher

Order

Factors)

21.74

50.0006

AllInteractions

12.25

40.0156

sex×

pclass

(Factor+

Higher

Order

Factors)

30.25

2<

0.0001

sex×

age.i(Factor+

Higher

Order

Factors)

8.95

40.0622

Nonlinear

5.63

30.1308


)vs.AB

5.63

30.1308

pclass×

age.i(Factor+

Higher

Order

Factors)

6.04

80.6427

Nonlinear

5.44

60.4882


)vs.AB

5.44

60.4882

age.i×

sibsp

(Factor+

Higher

Order

Factors)

12.25

40.0156

Nonlinear

2.04

30.5639


)vs.AB

2.04

30.5639

TOTAL

NONLIN

EAR

19.86

15

0.1772

TOTAL

INTERACTIO

N66.83

18

<0.0001

TOTAL

NONLIN

EAR

+IN

TERACTIO

N69.48

21

<0.0001

TOTAL

305.58

26

<0.0001

5)

*sibsp)

ModelLikelihood

Discrim

ination

RankDiscrim

.Ratio

Test

Indexes

Indexes

Obs

1309

LRχ2

641.01

R2

0.526

C0.861

0809

d.f.

26g

2.227

Dxy

0.722

1500

Pr(>

χ2)<

0.0001

g r9.272

γ0.728

max|deriv|4×10

−4

g p0.346

τ a0.341

Brier

0.133

p1←

Pre

dict(f,

age,

pclass,

sex,

fun=plo

gis

)p2←

Pre

dict(f.si,

age.i

,pclass,

sex,

fun=plo

gis

)p←

rbin

d('Case

wise

Deletion

'=p1,

'Sin

gle

Imputa

tion

'=p2,

rename=

c(age.i=

'age'))

#creates

.set.

variable

plo

t(p,∼

age|

pcla

ss*.set.,

gro

ups=

'sex

',

ylab=

'Probability

of

Surv

ivin

g',

adj.subtitle=FALSE)

#Figure

6.10

latex(anova(f.si),

file=

'',

label=

'titanic−anova.si')

#Table

6.5

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

149

Age

, yea

rs

Probability of Surviving

0.2

0.4

0.6

0.8

020

4060fem

ale

mal

e

1st

Cas

ewis

e D

elet

ion

fem

ale

mal

e

2nd

Cas

ewis

e D

elet

ion

020

4060

fem

ale

mal

e

3rd

Cas

ewis

e D

elet

ion

fem

ale

mal

e

1st

Sin

gle

Impu

tatio

n

020

4060

fem

ale

mal

e

2nd

Sin

gle

Impu

tatio

n

0.2

0.4

0.6

0.8

fem

ale

mal

e

3rd

Sin

gle

Impu

tatio

n

Figure

6.10:

Predicted

probabilityofsurvivalformalesfrom

fitusingcasewisedeletion

(leftpanel)and

single

conditionalmeanim

putation(rightpanel).

sibspis

setto

zero

forthesepredictedvalues.

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

150

6.6

Multiple

Imputation

The

followinguses

aregImputewithpredictive

mean

matching.

Bydefault,

aregImputedoes

nottrans-

form

agewhenitisbeing

predictedfrom

theother

variables.

Fourknotsareused

totransform

agewhen

used

toimpute

othervariables(not

needed

here

asno

othermissingswerepresent).

set.seed(17)

#so

can

reproduce

random

aspects

mi←

are

gIm

pute

(∼age

+sex

+pcla

ss

+sib

sp

+parc

h+

surv

ived,

n.impute

=5,nk=4,

pr=

FALSE)

mi

Multip

leIm

puta

tion

usin

gBootstrap

and

PMM

are

gIm

pute

(fo

rmula

=∼age

+sex

+pcla

ss

+sib

sp

+parc

h+

surv

ived,

n.im

pute

=5,

nk

=4,

pr=

FALSE)

n:

1309

p:

6Im

puta

tions:

5nk:

4

Number

of

NAs:

age

sex

pcla

ss

sib

sp

parch

surv

ived

263

00

00

0

type

d.f.

age

s1

sex

c1

pcla

ss

c2

sib

sp

s2

parch

s2

surv

ived

l1

Tra

nsfo

rmation

of

Targ

et

Variables

Forced

tobe

Lin

ear

R−square

sfo

rPre

dictin

gNon−

Missing

Values

for

Each

Variable

Using

Last

Imputa

tions

of

Pre

dictors

age

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

151

0.344

#Print

the

5imputations

for

the

first

10

passengers

#having

missing

age

mi$im

puted$age[1

:10,]

[,1

][,2

][,3

][,4

][,5

]16

28.5

60.0

32.5

46

71

38

26.0

26.0

29.0

49

51

41

47.0

62.0

47.0

55

42

47

45.0

47.0

17.0

46

39

60

39.0

27.0

42.0

39

18

70

39.0

39.0

23.0

30

41

71

29.0

42.0

47.0

47

61

75

46.0

28.5

32.5

17

36

81

47.0

48.0

30.0

55

40

107

62.0

50.0

23.0

33

17

Showthedistribution

ofimputed(black)andactual

ages

(gray).

plo

t(mi)

Ecdf(age,

add=

TRUE,

col=

'gra

y',

lwd=2,

subtitle

s=FALSE)

#Figure

6.11

020

4060

80

0.00.20.40.60.81.0

Impu

ted

age

Proportion <= x

Figure

6.11:

Distributionsofim

putedandactualages

fortheTitanic

dataset

Fitlogisticmodelsfor5completed

datasetsandprint

theratioof

imputation-corrected

variancesto

aver-

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

152

Table

6.6:

Wald


χ2

d.f.

P

sex(Factor+

Higher

Order

Factors)

236.24

7<

0.0001

AllInteractions

52.20

6<

0.0001

pclass

(Factor+

Higher

Order

Factors)

109.82

12

<0.0001

AllInteractions

37.09

10

0.0001

age(Factor+

Higher

Order

Factors)

49.09

20

0.0003

AllInteractions

22.73

16

0.1211

Nonlinear(Factor+

Higher

Order

Factors)

21.38

15

0.1251

sibsp

(Factor+

Higher

Order

Factors)

23.68

50.0003

AllInteractions

11.00

40.0266

sex×

pclass

(Factor+

Higher

Order

Factors)

33.48

2<

0.0001

sex×

age(Factor+

Higher

Order

Factors)

9.22

40.0559

Nonlinear

7.18

30.0663


)vs.AB

7.18

30.0663

pclass×

age(Factor+

Higher

Order

Factors)

3.66

80.8861

Nonlinear

3.27

60.7739


)vs.AB

3.27

60.7739

age×

sibsp

(Factor+

Higher

Order

Factors)

11.00

40.0266

Nonlinear

1.90

30.5925


)vs.AB

1.90

30.5925

TOTAL

NONLIN

EAR

21.38

15

0.1251

TOTAL

INTERACTIO

N65.11

18

<0.0001

TOTAL

NONLIN

EAR

+IN

TERACTIO

N68.89

21

<0.0001

TOTAL

302.90

26

<0.0001

ageordinary

variances

f.mi←

fit.m

ult.impute(surv

ived∼

(sex

+pcla

ss

+rc

s(age,5

))∧2

+rc

s(age,5

)*sib

sp,

lrm

,mi,

data=titanic3,

pr=

FALSE)

latex(anova(f.mi),

file=

'',

label=

'tita

nic−anova.m

i')

#Table

6.6

The

Waldχ2forageisreducedby

accounting

for

imputation

butisincreasedby

usingpatterns

ofas-

sociationwithsurvivalstatus

toimpute

missing

age.

Showestimated

effects

ofageby

classes.

p1←

Pre

dict(f.si,

age.i

,pclass,

sex,

fun=plo

gis

)p2←

Pre

dict(f.mi,

age,

pclass,

sex,

fun=plo

gis

)p←

rbin

d('Sin

gle

Imputa

tion

'=p1,

'M

ultip

leIm

puta

tion

'=p2,

rename=

c(age.i=

'age'))

plo

t(p,∼

age|

pcla

ss*.set.,

gro

ups=

'sex

',

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

153

ylab=

'Probability

of

Surv

ivin

g',

adj.subtitle=FALSE)

#Figure

6.12

Age

, yea

rs

Probability of Surviving

0.2

0.4

0.6

0.8

020

4060fem

ale

mal

e

1st

Mul

tiple

Impu

tatio

n

fem

ale

mal

e

2nd

Mul

tiple

Impu

tatio

n

020

4060

fem

ale

mal

e

3rd

Mul

tiple

Impu

tatio

n

fem

ale

mal

e

1st

Sin

gle

Impu

tatio

n

020

4060

fem

ale

mal

e

2nd

Sin

gle

Impu

tatio

n

0.2

0.4

0.6

0.8

fem

ale

mal

e

3rd

Sin

gle

Impu

tatio

n

Figure

6.12:

Predictedprobabilityofsurvivalformalesfrom

fitusingsingle

conditionalmeanim

putationagain

(left

panel)andmultiple

random

draw

imputation(rightpanel).

Both

sets

ofpredictionsare

forsibsp=0.

6.7

Summarizingth

eFitted

Model

Showodds

ratios

forchangesin

predictorvalues

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

154

s←

summary(f.mi,

age=

c(1,3

0),

sib

sp=0:1)

#override

default

ranges

for

3variables

plo

t(s,

log=TRUE,

main=

'')

#Figure

6.13

0.1

0 0

.50

2.0

0 6

.00

0.99

0.9

0.7

0.8

0.95

age

− 3

0:1

sibs

p −

1:0

sex

− fe

mal

e:m

ale

pcla

ss −

1st

:3rd

pcla

ss −

2nd

:3rd

Adj

uste

d to

:sex

=m

ale

pcla

ss=

3rd

age=

28 s

ibsp

=0

Figure

6.13:

Oddsratiosforsomepredictorsettings

Get

predictedvalues

forcertaintypes

ofpassengers

phat←

pre

dict(f.mi,

combos←

expand.g

rid(age=

c(2,2

1,5

0),sex=

levels

(sex),

pcla

ss=

levels

(pcla

ss),

sib

sp=0),

type=

'fitted

')

#Can

also

use

Predict(f.mi,

age=c(2,21,50),

sex,

pclass,

#sibsp=0,

fun=plogis)$yhat

options(dig

its=1)

data

.fra

me(co

mbos,

phat)

age

sex

pcla

ss

sib

sp

phat

12

female

1st

00.9

82

21

female

1st

00.9

83

50

female

1st

00.9

74

2male

1st

00.8

85

21

male

1st

00.4

66

50

male

1st

00.2

77

2female

2nd

01.0

08

21

female

2nd

00.9

09

50

female

2nd

00.8

310

2male

2nd

01.0

011

21

male

2nd

00.0

8

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

155

12

50

male

2nd

00.0

413

2female

3rd

00.8

414

21

female

3rd

00.5

715

50

female

3rd

00.3

716

2male

3rd

00.8

917

21

male

3rd

00.1

418

50

male

3rd

00.0

5

options(dig

its=5)

Wecanalso

getpredictedvalues

bycreating

anS

function

that

willevaluate

themodelon

demand.

pred.logit←

Function(f.mi)

#Note:

if

don't

define

sibsp

to

pred.logit,

defaults

to

0

#normally

just

type

the

function

name

to

see

its

body

latex(pred.logit

,file=

'',

type=

'Sinput',

siz

e=

'small

')

pred.logit←

functio

n(sex

=”male

”,

pcla

ss

=”3rd

”,

age

=28,

sib

sp

=0)

{3.5

810728−

1.2

694669

*(sex

==

”male

”)

+5.2

27106

*(pcla

ss

==

”2nd”)−

1.7

471648

*(pcla

ss

==

”3rd

”)

+0.0

72213655

*age−

0.0

0021294639

*

pmax(age−

4,

0)∧3

+0.0

015984839

*pmax(age−

21,

0)∧3−

0.0

023265999

*

pmax(age−

28,

0)∧3

+0.0

010212127

*pmax(age−

36.1

5,

0)∧3−

8.0150336e−05

*

pmax(age−

56,

0)∧3−

1.1

339431

*sib

sp

+(sex

==

”male

”)

*(−0.46284486

*

(pcla

ss

==

”2nd”)

+2.0

884806

*(pcla

ss

==

”3rd

”))

+(sex

==

”male

”)

*

(−0.22398928

*age

+0.0

003578076

*pmax(age−

4,

0)∧3−

0.0

02354863

*

pmax(age−

21,

0)∧3

+0.0

032067241

*pmax(age−

28,

0)∧3−

0.0

013085171

*pmax(age−

36.1

5,

0)∧3

+9.8848428e−05

*pmax(age−

56,

0)∧3)

+(pcla

ss

==

”2nd”)

*(−0.4600114

*age

+0.0

0052411339

*

pmax(age−

4,

0)∧3−

0.0

025239553

*pmax(age−

21,

0)∧3

+0.0

026577424

*

pmax(age−

28,

0)∧3−

0.0

0067164981

*pmax(age−

36.1

5,

0)∧3

+1.3749304e−05

*pmax(age−

56,

0)∧3)

+(pcla

ss

==

”3rd

”)

*(−0.14784979

*

age

+0.0

0021831279

*pmax(age−

4,

0)∧3−

0.0

01437761

*pmax(age−

21,

0)∧3

+0.0

020012161

*pmax(age−

28,

0)∧3−

0.0

0085968161

*

pmax(age−

36.1

5,

0)∧3

+7.7913743e−05

*pmax(age−

56,

0)∧3)

+sib

sp

*(0.0

45169115

*age−

2.90579e−05

*pmax(age−

4,

0)∧3

+0.0

0025289589

*pmax(age−

21,

0)∧3−

0.0

0048983359

*pmax(age−

28,

0)∧3

+0.0

0032115845

*pmax(age−

36.1

5,

0)∧3−

5.5162848e−05

*

pmax(age−

56,

0)∧3)

}

#Run

the

newly

created

function

plo

gis

(pred.logit

(age=

c(2,2

1,5

0),

sex=

'male

',

pcla

ss=

'3rd

'))

[1]

0.886318

0.135294

0.054266

CHAPTER

6.

LOGISTIC

MODELCASE

STUDY:SURVIV

ALOFTIT

ANIC

PASSENGERS

156

Anomogram

couldbeused

toobtain

predictedval-

uesmanually,butthisisnotfeasiblewhenso

many

interactionterm

sarepresent.

R/S-PlusSoftwareUsed

Package

Purpose

Functions

Hmisc

Miscellaneousfunctions

summary,plsmo,naclus,llist,latex

summarize,Dotplot,describe,dataRep

Hmisc

Imputation

transcan,impute,fit.mult.impute,aregImpute

rms

Modeling

datadist,lrm,rcs

Modelpresentation

plot,summary,nomogram,Function

Modelvalidation

validate,calibrate

rparta

Recursivepartitioning

rpart

aW

ritten

byAtkinson&

Thernea

u

Chapter7

Case

Stu

dyin

Para

metric

Surv

ival

Modelingand

ModelApproxim

ation

Data

source:Random

sampleof

1000

patients

from

PhasesI&

IIof

SUPPORT

(Study

toUn-

derstand

Prognoses

Preferences

Outcomes

andRisks

ofTreatment,funded

bytheRobertWoodJohnson

Foundation).See

70.The

datasetisavailablefrom

http://biostat.mc.vanderbilt.edu/DataSets.

�Analyze

acutediseasesubset

ofSUPPORT(acute

respiratoryfailure,multipleorgansystem

failure,

coma)

—theshapeof

thesurvivalcurves

isdiffer-

entbetweenacuteandchronicdiseasecategories

157

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N158

�Patientshadto

surviveuntilday3of

thestudyto

qualify

�Baselinephysiologicvariablesmeasuredduring

day

3

7.1

DescriptiveStatistics

Createavariable

acuteto

flag

categories

ofinterest;

printunivariabledescriptivestatistics.

require(rm

s)

getH

data

(support

)#

Get

data

frame

from

web

site

acute←

support$dzcla

ss

%in%

c('ARF/M

OSF

','Com

a')

latex(describe(support

[acute

,]),

file=

'')

support[acu

te,]

35Variables

537

Observations

age:Age

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

0529

60.7

28.49

35.22

47.93

63.67

74.49

81.54

85.56

lowest

:18.0418.41

19.76

20.30

20.31

highest:91.62

91.8291.93

92.7495.51

death

:Death

atanytimeupto

NDIdate:31DEC94

nmissing

unique

Sum

Mean

537

02

356

0.6629

sex

nmissing

unique

537

02

female

(251,

47%),

male

(286,

53%)

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N159

hosp

dead:Death

inHospital

nmissing

unique

Sum

Mean

537

02

201

0.3743

slos:Daysfrom

StudyEntryto

Disch

arge

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

085

23.44

4.0

5.0

9.0

15.0

27.0

47.4

68.2

lowest

:3

45

67,

highest:145164202236241

d.tim

e:DaysofFollow-U

p

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

0340

446.1

46

16

182

724

1421

1742

lowest

:3

45

67,

highest:1977

1979

19822011

2022

dzg

roup

nmissing

unique

537

03

ARF/MOSFw/Sepsis

(391,73%),

Coma

(60,11%),

MOSF

w/Malig(86,

16%)

dzclass

nmissing

unique

537

02

ARF/MOSF(477,

89%),Coma

(60,

11%)

num.co:numberofco

morbidities

nmissing

unique

Mean

537

07

1.525

01

23

45

6Frequency11119613351

31

105

%21

36

25

96

21

edu:YearsofEduca

tion

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

411

126

22

12.03

78

10

12

14

16

17

lowest

:0

12

34,

highest:17

18

1920

22

inco

me

nmissing

unique

335

202

4

under$11k

(158,

47%),

$11-$25k

(79,24%),

$25-$50k(63,

19%)

>$50k(35,

10%)

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N160

scoma:SUPPORT

ComaSco

rebasedonGlasgow

D3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

011

19.24

00

00

37

55

100

0926

37

4144

55

6189

94100

Frequency30150

44

1917

43

11

68

632

%56

98

43

82

11

16

charges:HospitalCharges

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

517

20

516

86652

11075

15180

27389

51079

100904

205562

283411

lowest

:3448

4432

4574

5555

5849

highest:504660

538323

543761

706577

740010

totcst

:TotalRCC

cost

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

471

66

471

46360

6359

8449

15412

29308

57028

108927

141569

lowest

:0

2071

2522

3191

3325

highest:269057

269131

338955

357919

390460

totm

cst:Totalmicro-cost

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

331

206

328

39022

6131

8283

14415

26323

54102

87495

111920

lowest

:0

1562

2478

2626

3421

highest:144234

154709

198047

234876

271467

avtisst:AverageTISS,Days3-25

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

536

1205

29.83

12.46

14.50

19.62

28.00

39.00

47.17

50.37

lowest

:4.000

5.667

8.000

9.000

9.500

highest:58.500

59.000

60.000

61.000

64.000

race

nmissing

unique

535

25

whiteblack

asianother

hispanic

Frequency

417

84

48

22

%78

16

11

4

meanbp:MeanArterialBloodPressure

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

0109

83.28

41.8

49.0

59.0

73.0

111.0

124.4

135.0

lowest

:0

20

27

30

32,

highest:

155

158

161162180

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N161

wblc

:W

hiteBloodCellCountDay

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

532

5241

14.1

0.8999

4.5000

7.9749

12.3984

18.1992

25.1891

30.1873

lowest

:0.05000

0.06999

0.09999

0.14999

0.19998

highest:

51.39844

58.19531

61.19531

79.39062100.00000

hrt

:HeartRate

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

0111

105

51

60

75

111

126

140

155

lowest

:0

11

30

36

40,

highest:

189

193

199232300

resp

:RespirationRate

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

045

23.72

810

12

24

32

39

40

lowest

:0

46

78,

highest:48

49

5260

64

temp:Temperature

(celcius)

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

061

37.52

35.50

35.80

36.40

37.80

38.50

39.09

39.50

lowest

:32.5034.00

34.09

34.90

35.00

highest:40.20

40.5940.90

41.0041.20

pafi:PaO2/(.01*FiO

2)Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

500

37

357

227.2

86.99

105.08

137.88

202.56

290.00

390.49

433.31

lowest

:45.00

48.00

53.33

54.00

55.00

highest:574.00

595.12

640.00

680.00

869.38

alb

:Serum

Albumin

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

346

191

34

2.668

1.700

1.900

2.225

2.600

3.100

3.400

3.800

lowest

:1.1001.200

1.300

1.400

1.500

highest:4.100

4.1994.500

4.6994.800

bili:Bilirubin

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

386

151

88

2.678

0.3000

0.4000

0.6000

0.8999

2.0000

6.5996

13.1743

lowest

:0.09999

0.19998

0.29999

0.39996

0.50000

highest:22.59766

30.0000031.50000

35.0000039.29688

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N162

crea:Serum

creatinineDay

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

084

2.232

0.6000

0.7000

0.8999

1.3999

2.5996

5.2395

7.3197

lowest

:0.3

0.4

0.5

0.6

0.7,

highest:10.4

10.6

11.211.6

11.8

sod:Serum

sodium

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

038

138.1

129

131

134

137

142

147

150

lowest

:118

120

121

126

127,

highest:156157158168175

ph:Serum

pH

(arterial)

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

500

37

49

7.416

7.270

7.319

7.380

7.420

7.470

7.510

7.529

lowest

:6.9606.989

7.069

7.119

7.130

highest:7.560

7.5697.590

7.6007.659

gluco

se:Gluco

seDay

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

297

240

179

167.7

76.0

89.0

106.0

141.0

200.0

292.4

347.2

lowest

:30

42

52

55

68,highest:

446

468

492

576

598

bun:BUN

Day

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

304

233

100

38.91

8.00

11.00

16.75

30.00

56.00

79.70

100.70

lowest

:1

34

56,

highest:123124125128146

urine:UrineOutputDay

3

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

303

234

262

2095

20.3

364.0

1156.5

1870.0

2795.0

4008.6

4817.5

lowest

:0

58

15

20,

highest:

68656920

7360

7560

7750

adlp

:ADLPatientDay

3

nmissing

unique

Mean

104

433

81.577

012

345

67

Frequency51

19

764

782

%49

187

647

82

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N163

adls

:ADLSurrogate

Day

3

nmissing

unique

Mean

392

145

81.86

01

23

45

67

Frequency18568

22

1817

20

3923

%47

17

65

45

10

6

sfdm2

nmissing

unique

468

69

5

no(M2andSIP

pres)

(134,29%),

adl>=4

(>=5

if

sur)

(78,

17%)

SIP>=30

(30,

6%),Coma

orIntub

(5,

1%),

<2

mo.

follow-up(221,

47%)

adlsc:Im

putedADLCalibratedto

Surrogate

nmissing

unique

Mean

.05

.10

.25

.50

.75

.90

.95

537

0144

2.119

0.000

0.000

0.000

1.839

3.375

6.000

6.000

lowest

:0.0000

0.4948

0.4948

1.00001.1667

highest:5.7832

6.0000

6.3398

6.4658

7.0000

#Show

patterns

of

missing

data

plo

t(naclu

s(support

[acute

,]))

#Figure

7.1

Showassociations

betweenpredictorsusingageneral

non-monotonic

measure

ofdependence(H

oeffding

D). ac←

support

[acute

,]ac$dzgro

up←

ac$dzgro

up[dro

p=TRUE]

#Remove

unused

levels

attach(ac)

vc←

varc

lus(∼

age+

sex+

dzgro

up+num.co+

edu+

income+

scoma+

race+

mea

nbp+

wblc+hrt+re

sp+temp+

pafi+alb+bili+

cre

a+so

d+

ph+

glu

cose+

bun+

urine+adlsc,

sim=

'hoeffdin

g')

plo

t(vc)

#Figure

7.2

7.2

Check

ingAdequacy

ofLog-N

orm

alAcc

elera

tedFailure

Tim

eM

odel

dd←

datadist(ac)

#describe

distributions

of

variables

to

rms

options(datadist=

'dd

')

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N164

adlscsod

creatempresp

hrtmeanbp

raceavtisst

wblccharges

totcstscoma

pafiphsfdm2

albbili

totmcstadlp

urineglucose

bunadls

eduincome

num.codzclass

dzgroupd.time

sloshospdead

sexage

death

0.50.40.30.20.10.0

Fraction Missing

Figure

7.1:

Cluster

analysisshow

ingwhichpredictors

tendto

bemissingonthesamepatients

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N165

eduincome$25−$50kincome$11−$25k

income>$50kadlsc

num.coglucose

albphpafi

meanbpurine

resphrt

tempagebili

creabun

sexmalesod

raceasianraceother

racehispanicraceblack

dzgroupComascoma

dzgroupMOSF w/Maligwblc

0.35 0.25 0.15 0.05

30 * Hoeffding D

Figure

7.2:

Hierarchicalclusteringofpotentialpredictors

usingHoeff

dingD

asasimilarity

measure.Categorical

predictors

are

automaticallyexpanded

into

dummyvariables.

#Generate

right-censored

survival

time

variable

years←

d.tim

e/365.2

5units(years

)←

'Year'

S←

Surv

(years

,death

)#

Show

normal

inverse

Kaplan-Meier

estimates

#stratified

by

dzgroup

surv

plo

t(survfit(S∼

dzgro

up),

conf=

'none',

fun=qnorm

,lo

gt=

TRUE)

#Figure

7.3

Morestringentassessmentof

log-norm

alassump-

tions:

checkdistribution

ofresidualsfrom

anad-

justed

model:

f←

psm

(S∼

dzgro

up

+rc

s(age,5

)+

rcs(meanbp,5

),

dist=

'lognorm

al',

y=TRUE)

#dist='gaussian'

for

S+

r←

resid

(f)

surv

plo

t(r,

dzgro

up,

label.curve=FALSE)

surv

plo

t(r,

age,

label.curve=FALSE)

surv

plo

t(r,

meanbp,

label.curve=FALSE)

random.number←

runif(length

(age))

surv

plo

t(r,

random.number,

label.curve=FALSE)

#Figure

7.4

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N166

log

Sur

viva

l Tim

e in

Yea

rs

−3

−2

−1

01

2

−2−1012

dzgr

oup=

AR

F/M

OS

F w

/Sep

sis

dzgr

oup=

Com

a

dzgr

oup=

MO

SF

w/M

alig

Figure

7.3:

Φ−1(S

KM(t))

stratified

by

dzgroup.

Linearity

and

semi-parallelism

indicate

areasonable

fitto

the

log-norm

alacceleratedfailure

timemodel

withrespectto

onepredictor.

The

fitfordzgroupisnotgreatbutoverallfitisgood.

Rem

ovefrom

considerationpredictorsthat

aremiss-

ingin

>0.2of

thepatients.Manyof

thesewere

onlycollected

forthesecond

phaseof

SUPPORT.

Ofthosevariablesto

beincluded

inthemodel,find

which

oneshave

enough

potentialpredictive

pow

erto

justifyallowingfornonlinearrelationshipsor

multiple

categories,which

spendmored.f.

For

each

variable

compute

Spearm

anρ2basedon

multiplelinearre-

gression

ofrank(x),rank(x)2

andthesurvivaltime,

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N167

Res

idua

l

Survival Probability

−3.

0−

2.0

−1.

00.

00.

51.

01.

52.

0

0.00.20.40.60.81.0

Res

idua

l


−3.

0−

2.0

−1.

00.

00.

51.

01.

52.

0

0.00.20.40.60.81.0A

ge

Res

idua

l


−3.

0−

2.0

−1.

00.

00.

51.

01.

52.

0

0.00.20.40.60.81.0

Mea

n A

rter

ial B

loo

d P

ress

ure

Day

3

Res

idua

l


−3.

0−

2.0

−1.

00.

00.

51.

01.

52.

0

0.00.20.40.60.81.0

Figure

7.4:

Kaplan-M

eier

estimatesofdistributionsofnorm

alized,right-censoredresidualsfrom

thefitted

log-norm

al

survivalmodel.Residuals

are

stratified

byim

portantvariablesin

themodel

(byquartiles

ofcontinuousvariables),

plusarandom

variable

todepictthenaturalvariability(inthelower

rightplot).Theoreticalstandard

Gaussian

distributionsofresiduals

are

show

nwithathicksolidline.

Theupper

left

plotis

withrespectto

disease

group.

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N168

truncating

survivaltimeat

theshortestfollow-upfor

survivors(356

days).

Thisrids

thedata

ofcensoring

butcreatesmanyties

at356days.

shortest.follow.u

p←

min(d.tim

e[death

==0],

na.rm=TRUE)

d.tim

et←

pmin(d.tim

e,

shortest.follow.u

p)

w←

spearm

an2(d.tim

et∼

age

+num.co

+scoma

+meanbp

+hrt

+re

sp

+temp

+cre

a+

sod

+adlsc

+wblc

+pafi

+ph

+dzgro

up

+ra

ce,

p=2)

plo

t(w

,main=

'')

#Figure

7.5

Adj

uste

d ρ2

0.00

0.02

0.04

0.06

0.08

0.10

0.12

●

●

● ●

●

●

●

● ●

●

●

●

●

●

●

535

4 5

37 2

537

2 5

37 2

532

2 5

37 2

537

2 5

37 2

537

2 5

00 2

500

2 5

37 2

537

2 5

37 2

537

2

N d

fra

ce

resp

ag

e

num

.co

w

blc

te

mp

ad

lsc

hr

t so

d

ph

pafi

cr

ea

dzgr

oup

m

eanb

p

scom

a

Figure

7.5:

GeneralizedSpearm

anρ2rankcorrelationbetweenpredictors

andtruncatedsurvivaltime

Abetterapproach

isto

usethecompleteinform

a-tion

inthefailure

andcensoringtimes

bycomputing

Som

ers’

Dxyrank

correlationallowingforcensor-

ing. w←

rcorrcens(S∼

age

+num.co

+scoma

+meanbp

+hrt

+re

sp

+temp

+cre

a+

sod

+adlsc

+wblc

+pafi

+ph

+dzgro

up

+ra

ce)

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N169

plo

t(w

,main=

'')

#Figure

7.6

|Dxy

|

0.00

0.05

0.10

0.15

0.20

● ● ● ●

●

●

●

●

●

●

●

● ●

●

●

537

532

537

535

537

537

537

537

537

500

500

537

537

537

537 N

tem

p

wbl

c

sod

ra

ce

resp

hr

t nu

m.c

o

age

ad

lsc

ph

pa

fi

scom

a

dzgr

oup

cr

ea

mea

nbp

Figure

7.6:

Somers’

Dxyrankcorrelationbetweenpredictors

andoriginalsurvivaltime.

Fordzgrouporrace,the

correlationcoeffi

cientis

themaxim

um

correlationfrom

usingadummyvariable

torepresentthemost

frequentor

oneto

representthesecondmost

frequentcategory.

#Compute

number

of

missing

values

per

variable

sapply

(llis

t(age,n

um.co,sco

ma,m

eanbp,hrt

,re

sp,tem

p,crea,so

d,adlsc,

wblc

,pafi

,ph),

functio

n(x)

sum(is.n

a(x)))

age

num.co

scoma

meanbp

hrt

resp

temp

cre

aso

dadlsc

00

00

00

00

00

wblc

pafi

ph

537

37

#Can

also

do

naplot(naclus(support[acute,]))

#Can

also

use

the

Hmisc

naclus

and

naplot

functions

to

do

this

#Impute

missing

values

with

normal

or

modal

values

wblc

.i←

impute

(wblc

,9)

pafi.i←

impute

(pafi

,333.3

)ph.i

←im

pute

(ph,

7.4

)ra

ce2←

race

levels

(ra

ce2)←

list(white=

'white

',oth

er=

levels

(ra

ce)[−

1])

race2[is.n

a(ra

ce2)]←

'white

'

dd←

datadist(dd,

wblc.i

,pafi.i

,ph.i

,ra

ce2)

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N170

Doaform

alredundancy

analysis

usingmorethan

pairwiseassociations,andallow

fornon-monotonic

transformations

inpredicting

each

predictorfrom

all

otherpredictors.Thisanalysisrequires

missing

val-

uesto

beimputedso

asto

notgreatlyreduce

the

samplesize.

redun(∼

cre

a+

age

+sex

+dzgro

up

+num.co

+scoma

+adlsc

+ra

ce2

+mea

nbp

+hrt

+re

sp

+temp

+so

d+

wblc

.i+

pafi.i+

ph.i

,nk=4)

Redundancy

Analysis

redun(fo

rmula

=∼cre

a+

age

+sex

+dzgro

up

+num.co

+scoma

+adlsc

+ra

ce2

+meanbp

+hrt

+re

sp

+temp

+so

d+

wblc

.i+

pafi

.i+

ph.i,

nk

=4)

n:

537

p:

16

nk:

4

Number

of

NAs:

0

Tra

nsfo

rmation

of

targ

et

varia

ble

sfo

rced

tobe

linear

R2

cutoff

:0.9

Type:

ord

inary

R2

with

which

each

varia

ble

can

be

pre

dicted

from

all

oth

er

varia

ble

s:

cre

aage

sex

dzgro

up

num.co

scoma

adlsc

race2

0.133

0.246

0.132

0.451

0.147

0.418

0.153

0.151

mea

nbp

hrt

resp

temp

sod

wblc

.i

pafi

.i

ph.i

0.178

0.258

0.131

0.197

0.135

0.093

0.143

0.171

No

redundant

varia

ble

s

Betterapproach

togaugingpredictive

potentialand

allocating

d.f.:

�Allowallcontinuous

variablesto

have

athemaxi-

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N171

mum

numberofknotsentertained,inalog-norm

alsurvivalmodel

�Mustuseimputation

toavoidlosing

data

�Fita“saturated”maineffects

model

�Makes

fulluseof

censored

data

�Had

tolim

itto

4knots,forcescomato

belinear,

andom

itph.ito

avoidsingularity

k←

4f←

psm

(S∼

rcs(age,k

)+sex+

dzgro

up+pol(

num.co,2)+

scoma+

pol(

adlsc,2)+

race+

rcs(meanbp,k

)+rc

s(hrt

,k)+

rcs(re

sp,k

)+rc

s(temp,k

)+rc

s(crea,3)+

rcs(so

d,k

)+rc

s(wblc.i

,k)+

rcs(pafi.i

,k),

dist=

'lognorm

al')

plo

t(anova(f))

#Figure

7.7

χ2 − d

f

010

2030

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

dzgr

oup

cr

ea

mea

nbp

ag

e

pafi.

i sc

oma

re

sp

adls

c

wbl

c.i

hrt

num

.co

so

d

race

te

mp

se

x

Figure

7.7:

Partialχ2statisticsforassociationofeach

predictorwithresponse

from

saturatedmain

effects

model,

penalizedford.f.

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N172

�Figure7.7properlyblinds

theanalystto

theform

ofeffects

(tests

oflinearity).

�Fitalog-norm

alsurvival

model

withnumber

ofparameterscorresponding

tononlineareffects

de-

term

ined

from

Figure7.7.

For

themostprom

ising

predictors,five

knotscanbeallocated,

asthere

arefewer

singularityproblemsonce

less

prom

ising

predictorsaresimplified.

f←

psm

(S∼

rcs(age,5)+

sex+

dzgro

up+num.co+

scoma+

pol(

adlsc,2)+

race2+rc

s(meanbp,5)+

rcs(hrt

,3)+

rcs(re

sp,3)+

temp+

rcs(crea,4)+

sod+rc

s(wblc.i

,3)+

rcs(pafi.i

,4),

dist=

'lognorm

al')

#'gaussian'

for

S+

prin

t(f,

latex=TRUE)

Parametric

SurvivalModel:

LogNorm

alDistribution

psm(formula=

S~

rcs(age,

5)+

sex+dzgroup+

num.co+scoma+

pol(adlsc,2)

+race2+rcs(meanbp,

5)

+rcs(hrt,3)+rcs(resp,

3)

+temp

+rcs(crea,

4)+

sod+rcs(wblc.i,

3)

+rcs(pafi.i,

4),dist

="lognormal")

ModelLikelihood

Discrim

ination

Ratio

Test

Indexes

Obs

537

LRχ2

236.83

R2

0.594

Events

356

d.f.

30g

1.959

σ2.2308

Pr(>

χ2)<

0.0001

g r7.095

Coef

S.E.

WaldZ

Pr(>|Z|)

(Intercept)

-5.6883

3.7851

-1.50

0.1329

age

-0.0148

0.0309

-0.48

0.6322

age’

-0.0412

0.1078

-0.38

0.7024

age”

0.1670

0.5594

0.30

0.7653

age”’

-0.2099

1.3707

-0.15

0.8783

sex=

male

-0.0737

0.2181

-0.34

0.7354

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N173

Coef

S.E.

WaldZ

Pr(>|Z|)

dzgroup=Com

a-2.0676

0.4062

-5.09

<0.0001

dzgroup=MOSFw/M

alig

-1.4664

0.3112

-4.71

<0.0001

num.co

-0.1917

0.0858

-2.23

0.0255

scom

a-0.0142

0.0044

-3.25

0.0011

adlsc

-0.3735

0.1520

-2.46

0.0140

adlsc2

0.0442

0.0243

1.82

0.0691

race2=

other

0.2979

0.2658

1.12

0.2624

meanbp

0.0702

0.0210

3.34

0.0008

meanbp’

-0.3080

0.2261

-1.36

0.1732

meanbp”

0.8438

0.8556

0.99

0.3241

meanbp”’

-0.5715

0.7707

-0.74

0.4584

hrt

-0.0171

0.0069

-2.46

0.0140

hrt’

0.0064

0.0063

1.02

0.3090

resp

0.0454

0.0230

1.97

0.0483

resp’

-0.0851

0.0291

-2.93

0.0034

temp

0.0523

0.0834

0.63

0.5308

crea

-0.4585

0.6727

-0.68

0.4955

crea’

-11.5176

19.0027

-0.61

0.5444

crea”

21.9840

31.0113

0.71

0.4784

sod

0.0044

0.0157

0.28

0.7792

wblc.i

0.0746

0.0331

2.25

0.0242

wblc.i’

-0.0880

0.0377

-2.34

0.0195

pafi.i

0.0169

0.0055

3.07

0.0021

pafi.i’

-0.0569

0.0239

-2.38

0.0173

pafi.i”

0.1088

0.0482

2.26

0.0239

Log(scale)

0.8024

0.0401

19.99

<0.0001

7.3

Summarizingth

eFitted

Model

�Plottheshapeof

theeffectof

each

predictoron

logsurvivaltime.

�Alleffects

centered:canbeplaced

oncommon

scale

�Waldχ2statistics,penalized

ford.f.,plottedin

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N174

Table

7.2:

Wald

StatisticsforS

χ2

d.f.

P

age

15.99

40.0030

Nonlinear

0.23

30.9722

sex

0.11

10.7354

dzgroup

45.69

2<

0.0001

num.co

4.99

10.0255

scoma

10.58

10.0011

adlsc

8.28

20.0159

Nonlinear

3.31

10.0691

race2

1.26

10.2624

meanbp

27.62

4<

0.0001

Nonlinear

10.51

30.0147

hrt

11.83

20.0027

Nonlinear

1.04

10.3090

resp

11.10

20.0039

Nonlinear

8.56

10.0034

temp

0.39

10.5308

crea

33.63

3<

0.0001

Nonlinear

21.27

2<

0.0001

sod

0.08

10.7792

wblc.i

5.47

20.0649

Nonlinear

5.46

10.0195

pafi.i

15.37

30.0015

Nonlinear

6.97

20.0307

TOTAL

NONLIN

EAR

60.48

14

<0.0001

TOTAL

261.47

30

<0.0001

descending

order

plo

t(Pre

dict(f,

ref.zero=TRUE))

#Figure

7.8

latex(anova(f),

file=

'',

label=

'su

pport−anovat')

#Table

7.2

plo

t(anova(f))

#Figure

7.9

options(dig

its=3)

plo

t(summary(f),

log=TRUE,

main=

'')

#Figure

7.10

7.4

Intern

alValidation

ofth

eFitted

ModelUsingth

eBootstrap

Validateindexesdescribing

thefitted

model.

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N175

log(T)

−2

−101

01

23

45

67

adls

c

2030

4050

6070

8090

age

0 2

4 6

8

crea

AR

F/M

wC

oma

MO

SF

w/

●

●

●

dzgr

oup

50

100

150

hrt

40

60

80

1001

2014

0

mea

nbp

01

23

45

6

num

.co

100

200

300

400

500

−2

−1

01

pafi.

i

−2

−101

whi

teot

her

●●

race

2

1020

3040

resp

0 2

0 4

0 6

0 8

010

0

scom

a

fem

ale

mal

e

●●

sex

1301

3514

0145

1501

55

sod

3536

3738

3940

tem

p

010

2030

40

−2

−1

01

wbl

c.i

Figure

7.8:

Effectofeach

predictoronlogsurvivaltime.

Predictedvalues

havebeencenteredso

thatpredictions

atpredictorreference

values

are

zero.Pointw

ise0.95confidence

bandsare

alsoshow

n.AsallY-axes

havethesame

scale,itis

easy

toseewhichpredictors

are

strongest.

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N176

χ2 − d

f

010

2030

40

●

●

●

● ●

● ● ●

●

●

●

●

● ● ●

dzgr

oup

cr

ea

mea

nbp

pa

fi.i

age

hr

t sc

oma

re

sp

adls

c

num

.co

w

blc.

i ra

ce2

te

mp

se

x

sod

Figure

7.9:

Contributionofvariablesin

predictingsurvivaltimein

log-norm

almodel

0.10

0.50

1.00

2.00

4.00

age

− 7

4.5:

47.9

num

.co

− 2

:1sc

oma

− 3

7:0

adls

c −

3.3

8:0

mea

nbp

− 1

11:5

9hr

t − 1

26:7

5re

sp −

32:

12te

mp

− 3

8.5:

36.4

crea

− 2

.6:0

.9so

d −

142

:134

wbl

c.i −

18.

2:8.

1pa

fi.i −

323

:142

sex

− fe

mal

e:m

ale

0.99

0.9

0.7

0.8

0.95

dzgr

oup

− C

oma:

AR

F/M

OS

F w

/Sep

sis

dzgr

oup

− M

OS

F w

/Mal

ig:A

RF

/MO

SF

w/S

epsi

sra

ce2

− o

ther

:whi

te

Figure

7.10:

Estim

atedsurvivaltimeratiosfordefault

settingsofpredictors.Forexample,when

agechanges

from

itslower

quartileto

theupper

quartile(47.9yto

74.5y),mediansurvivaltimedecreasesbymore

thanhalf.Different

shaded

areasofbars

indicate

differentconfidence

levels,rangingfrom

0.7

to0.99.

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N177

#First

add

data

to

model

fit

so

bootstrap

can

re-sample

#from

the

data

g←

update

(f,

x=TRUE,

y=TRUE)

set.seed(7

17)

latex(validate(g,B=120,

dxy=

TRUE),

dig

its=2,

siz

e=

'Ssize

')

Index

Original

Training

Test

Optimism

Corrected

nSam

ple

Sam

ple

Sam

ple

Index

Dxy

0.49

0.51

0.46

0.05

0.43

120

R2

0.59

0.66

0.54

0.12

0.47

120

Intercept

0.00

0.00

−0.06

0.06

−0.06

120

Slope

1.00

1.00

0.90

0.10

0.90

120

D0.48

0.55

0.42

0.13

0.35

120

U0.00

0.00

−0.01

0.01

−0.01

120

Q0.48

0.55

0.43

0.12

0.36

120

g1.96

2.06

1.86

0.19

1.76

120

�From

DxyandR2thereisamoderateam

ount

ofoverfitting.

�Slopeshrinkagefactor

(0.90)

isnottroublesom

e

�Almostunbiased

estimateof

future

predictive

dis-

criminationon

similarpatients

isthecorrected

Dxyof

0.43.

Validatepredicted1-yearsurvivalprobabilities.Use

asm

ooth

approach

that

doesnotrequirebinning7

1and

useless

preciseKaplan-Meier

estimates

obtained

bystratifyingpatientsby

thepredictedprobability,with

atleast60

patients

per

group.

set.seed(7

17)

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N178

cal←

calibrate(g,

u=1,B=120)

plo

t(cal,

subtitle

s=FALSE)

cal←

calibrate(g,

cmeth

od=

'KM

',

u=1,m=60,B=120,

pr=

FALSE)

plo

t(cal,

add=

TRUE)

#Figure

7.11

0.0

0.2

0.4

0.6

0.8

0.00.20.40.60.8

Pre

dict

ed 1

Yea

r S

urvi

val

Fraction Surviving 1 Years

●●

●

●

●

●

●

●

Figure

7.11:

Bootstrapvalidationofcalibrationcurve.

Dots

representapparentcalibrationaccuracy;×

are

bootstrap

estimatescorrectedforoverfitting,basedonbinningpredictedsurvivalprobabilitiesandandcomputingKaplan-M

eier

estimates.

Black

curveistheestimatedobserved

relationship

usinghareandthebluecurveistheoverfitting-corrected

hareestimate.Thegray-scale

linedepicts

theidealrelationship.

7.5

Approxim

atingth

eFullM

odel

The

fitted

log-norm

almodelisperhaps

toocomplex

forroutineuseandforroutinedata

collection.

Let

usdevelopasimplified

model

that

canpredictthe

predictedvalues

ofthefullmodelwithhigh

accuracy

(R2=0.96).

The

simplification

isdone

usingafast

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N179

backwardstepdownagainstthefullmodelpredicted

values.

Z←

pre

dict(f)

#X*beta

hat

a←

ols

(Z∼

rcs(age,5)+

sex+

dzgro

up+num.co+

scoma+

pol(

adlsc,2)+

race2+

rcs(meanbp,5)+

rcs(hrt

,3)+

rcs(re

sp,3)+

temp+

rcs(crea,4)+

sod+rc

s(wblc.i

,3)+

rcs(pafi.i

,4),

sigma=1)

#sigma=1

is

used

to

prevent

sigma

hat

from

being

zero

when

#R2=1.0

since

we

start

out

by

approximating

Zwith

all

#component

variables

fastbw(a,

aic

s=10000)

#fast

backward

stepdown

Delete

dChi−

Sq

d.f.P

Resid

ual

d.f.P

AIC

R2

sod

0.4

31

0.512

0.4

31

0.5117

−1.57

1.000

sex

0.5

71

0.451

1.0

02

0.6073

−3.00

0.999

temp

2.2

01

0.138

3.2

03

0.3621

−2.80

0.998

race2

6.8

11

0.009

10.01

40.0402

2.0

10.994

wblc

.i

29.52

20.000

39.53

60.0000

27.53

0.976

num.co

30.84

10.000

70.36

70.0000

56.36

0.957

resp

54.18

20.000

124.55

90.0000

106.55

0.924

adlsc

52.46

20.000

177.00

11

0.0000

155.00

0.892

pafi

.i

66.78

30.000

243.79

14

0.0000

215.79

0.851

scoma

78.07

10.000

321.86

15

0.0000

291.86

0.803

hrt

83.17

20.000

405.02

17

0.0000

371.02

0.752

age

68.08

40.000

473.10

21

0.0000

431.10

0.710

cre

a314.47

30.000

787.57

24

0.0000

739.57

0.517

mea

nbp

403.04

40.000

1190.61

28

0.0000

1134.61

0.270

dzgro

up

441.28

20.000

1631.89

30

0.0000

1571.89

0.000

Appro

xim

ate

Estimate

saft

er

Deleting

Facto

rs

Coef

S.E

.Wald

ZP

[1,]−0.5928

0.04315−13.74

0

Facto

rsin

Fin

al

Model

None

f.appro

x←

ols

(Z∼

dzgro

up

+rc

s(meanbp,5

)+

rcs(crea,4

)+

rcs(age,5

)+

rcs(hrt

,3)

+scoma

+rc

s(pafi.i

,4)

+pol(

adlsc,2)+

rcs(re

sp,3

),

x=TRUE)

f.appro

x$stats

nModelL.R

.d.f.

R2

gSigma

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N180

537.000

1688.225

23.000

0.957

1.915

0.370

�Estimatevariance–covariancematrixof

thecoef-

ficients

ofreducedmodel

�Thiscovariance

matrixdoes

notincludethescale

parameter

V←

vcov(f,regcoef.only=TRUE)

#var(full

model)

X←

g$x

#full

model

design

x←

f.appro

x$x

#approx.

model

design

w←

solve(t(x)%*%

x,

t(x))%*%

X#

contrast

matrix

v←

w%*%

V%*%

t(w

)

Com

pare

variance

estimates

(diagonals

ofv)with

variance

estimates

from

areducedmodelthat

isfit-

tedagainsttheactualoutcom

es.

f.sub←

psm

(S∼

dzgro

up

+rc

s(meanbp,5

)+

rcs(crea,4

)+

rcs(age,5

)+

rcs(hrt

,3)

+scoma

+rc

s(pafi.i

,4)

+pol(

adlsc,2)+

rcs(re

sp,3

),

dist=

'lognorm

al')

#'gaussian'

for

S+

diag(v)/diag(vcov(f.sub,regcoef.only=TRUE))

Interc

ept

dzgro

up=Com

adzgro

up=MOSF

w/Malig

0.981

0.979

0.979

mea

nbp

meanbp'

meanbp''

0.977

0.979

0.979

meanbp'''

cre

acrea

'

0.979

0.979

0.979

crea

''

age

age'

0.979

0.982

0.981

age''

age'''

hrt

0.981

0.980

0.978

hrt

'scoma

pafi

.i

0.976

0.979

0.980

pafi

.i'

pafi

.i''

adlsc

0.980

0.980

0.981

adlsc∧2

resp

resp

'

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N181

Table

7.3:

Wald

StatisticsforZ

χ2

d.f.

P

dzgroup

55.94

2<

0.0001

meanbp

29.87

4<

0.0001

Nonlinear

9.84

30.0200

crea

39.04

3<

0.0001

Nonlinear

24.37

2<

0.0001

age

18.12

40.0012

Nonlinear

0.34

30.9517

hrt

9.87

20.0072

Nonlinear

0.40

10.5289

scoma

9.85

10.0017

pafi.i

14.01

30.0029

Nonlinear

6.66

20.0357

adlsc

9.71

20.0078

Nonlinear

2.87

10.0904

resp

9.65

20.0080

Nonlinear

7.13

10.0076

TOTAL

NONLIN

EAR

58.08

13

<0.0001

TOTAL

252.32

23

<0.0001

0.981

0.978

0.977

The

ratios

ranged

from

0.978to

0.982.

f.appro

x$var←

vla

tex(anova(f.appro

x,

test=

'Chisq

',

ss=

FALSE),

file=

'',

label=

'suport.a

novaa

')

Equationforsimplified

model:

#Typeset

mathematical

form

of

approximate

model

latex(f.appro

x,

file=

'')

E(Z)=

Xβ,

where

Xβ= −2.51

−1.94{Com

a}−

1.75{MOSFw/M

alig}

+0.068m

eanbp−3.08×10

−5(m

eanbp−41.8)3 +

+7.9×10

−5(m

eanbp−

61)3 +

−4.91×10

−5(m

eanbp−73)3 +

+2.61×10

−6(m

eanbp−109)

3 +−1.7×10

−6(m

eanbp−135)

3 +

−0.553crea−0.229(crea−0.6)

3 ++0.45(crea−1.1)

3 +−0.233(crea−1.94)3 +

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N182

+0.0131(crea−7.32)3 +

−0.0165age−1.13×10

−5(age−28.5)3 +

+4.05×10

−5(age−49.5)3 +

−2.15×10

−5(age−63.7)3 +−2.68×10

−5(age−72.7)3 +

+1.9×10

−5(age−85.6)3 +

−0.0136hrt+6.09×10

−7(hrt−60)3 +−1.68×10

−6(hrt−111)

3 ++1.07×10

−6(hrt−140)

3 +

−0.0135

scom

a

+0.0161pafi

.i−4.77×10

−7(pafi

.i−88)3 +

+9.11×10

−7(pafi

.i−167)

3 +

−5.02×10

−7(pafi

.i−276)

3 ++6.76×10

−8(pafi

.i−426)

3 +

−0.3693

adlsc+0.0409

adlsc2

+0.0394resp−9.11×10

−5(resp−10)3 +

+0.000176(resp−24)3 +−8.5×10

−5(resp−39)3 +

and{c}

=1ifsubject

isin

grou

pc,

0otherwise;

(x) +

=xifx>

0,0otherwise.

Nom

ogram

forpredicting

medianandmeansurvival

time,basedon

approximatemodel:

#Derive

Sfunctions

that

express

mean

and

quantiles

#of

survival

time

for

specific

linear

predictors

#analytically

expected.surv←

Mean(f)

quantile.s

urv←

Quantile

(f)

latex(expecte

d.surv

,file=

'',

type=

'Sinput')

expected.surv←

functio

n(lp

=NULL,

parm

s=

0.802352037606488)

{names

(parm

s)←

NULL

exp(lp

+exp(2

*parm

s)/2)

}

latex(quantile.surv

,file=

'',

type=

'Sinput')

quantile.s

urv←

functio

n(q

=0.5

,lp

=NULL,

parm

s=

0.802352037606488)

{names

(parm

s)←

NULL

f←

functio

n(lp

,q,

parm

s)

lp+

exp(parm

s)

*qnorm

(q)

names

(q)←

form

at(q)

dro

p(exp(oute

r(lp

,q,FUN

=f,

parm

s=

parm

s)))

}

median.surv

←fu

nctio

n(x)

quantile.s

urv(lp=x)

#Improve

variable

labels

for

the

nomogram

f.appro

x←

Newlabels(f.appro

x,

c('Disease

Group

','Mean

Arteria

lBP

',

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N183

'Cre

atin

ine

','Age','Heart

Rate

','SUPPORT

Com

aScore

',

'PaO

2/(.0

1*FiO

2)','ADL','Resp

.Rate

'))

nom←

nomogram(f.appro

x,

pafi.i=

c(0,

50,

100,

200,

300,

500,

600,

700,

800,

900),

fun=list('Median

Surv

ival

Tim

e'=

median.surv

,'Mean

Surv

ival

Tim

e'

=expected.surv

),

fun.a

t=c(.1

,.25,.5

,1,2

,5,1

0,2

0,4

0))

plo

t(nom

,cex.v

ar=1,

cex.a

xis=.75,

lmgp=

.25)

#Figure

7.12

Poi

nts

010

2030

4050

6070

8090

100

Dis

ease

Gro

upC

oma

AR

F/M

OS

F w

/Sep

sis

MO

SF

w/M

alig

Mea

n A

rter

ial B

P0

2040

6080

120

Cre

atin

ine

53

21

0

67

89

1011

12

Age

100

7060

5030

10

Hea

rt R

ate

300

200

100

500

SU

PP

OR

T C

oma

Sco

re10

070

5030

10

PaO

2/(.

01*F

iO2)

050

100

200

300

500

700

900

AD

L4.

52

10

57

Res

p. R

ate

05

15

6560

5550

4540

3530

Tota

l Poi

nts

050

100

150

200

250

300

350

400

450

Line

ar P

redi

ctor

−7

−5

−3

−1

12

34

Med

ian

Sur

viva

l Tim

e0.

10.2

50.5

12

510

2040

Mea

n S

urvi

val T

ime

0.10

.250

.51

25

1020

40

Figure

7.12:

Nomogram

forpredictingmedianandmeansurvivaltime,

basedonapproxim

ationoffullmodel

CHAPTER

7.

PARAMETRIC

SURVIV

ALMODELIN

GAND

MODELAPPROXIM

ATIO

N184

SPackagesandFunctionsUsed

Packages

Purpose

Functions

Hmisc

Miscellaneousfunctions

describe,ecdf,naclus,

varclus,llist,spearman2

describe,impute,latex

rms

Modeling

datadist,psm,rcs,ols,fastbw

Modelpresentation

survplot,Newlabels,Function,

Mean,Quantile,nomogram

Modelvalidation

validate,calibrate

Note:

Allpackagesareavailablefrom

CRAN

Bibliogra

phy

[1]D.G.Altman.Categorisingcontinuouscovariates

(letterto

theeditor).

BritJCancer,64:975,1991.

[26]

[2]D.G.Altman.Suboptimal

analysisusing‘optimal’cutpoints.BritJCancer,78:556–557,1998.

[26]

[3]D.G.Altman

andP.K.Andersen.Bootstrap

investigationofthestability

ofaCox

regressionmodel.StatMed,

8:771–783,1989.

[68]

[4]D.G.Altman,B.Lausen,W.Sauerbrei,andM.Schumacher.Dangersofusing‘optimal’cutpoints

intheevaluation

ofprognostic

factors.

JNat

CancerInst,86:829–835,1994.

[26,28]

[5]A.C.Atkinson.A

note

onthegeneralized

inform

ationcriterionforchoiceofamodel.Biometrika,67:413–418,

1980.

[39,67]

[6]P.C.Austin.Bootstrap

modelselectionhad

similarperform

ance

forselectingauthenticandnoisevariablescompared

tobackw

ardvariable

elim

ination:asimulationstudy.

JClin

Epi,61:1009–1017,2008.

[68]

[7]P.C.Austin,J.

V.Tu,andD.S.Lee.Logisticregressionhad

superiorperform

ance

compared

withregressiontrees

forpredictingin-hospital

mortalityin

patients

hospitalized

withheart

failure.JClin

Epi,63:1145–1155,2010.

[45]

[8]H.Belcher.Theconceptofresidual

confoundingin

regressionmodelsandsomeapplications.

StatMed,11:1747–

1758,1992.

[26]

[9]D.A.Belsley.ConditioningDiagnostics:

Collinearity

andWeakDatain

Regression.Wiley,New

York,

1991.

[74]

[10]D.A.Belsley,E.Kuh,andR.E.Welsch.

RegressionDiagnostics:

IdentifyingInfluential

DataandSources

of

Collinearity.Wiley,New

York,

1980.

[89,90]

[11]J.

K.Benedetti,P.Liu,H.N.Sather,J.

Seinfeld,andM.A.Epton.Effective

sample

size

fortestsofcensored

survival

data.

Biometrika,69:343–349,1982.

[69]

[12]K.Berhane,

M.Hauptm

ann,andB.Langholz.Usingtensorproduct

splines

inmodelingexposure–time–response

relationships:

Applicationto

theColoradoPlateau

Uranium

Minerscohort.

StatMed,27:5484–5496,2008.

[57]

[13]M.Blettner

andW.Sauerbrei.Influence

ofmodel-buildingstrategiesontheresultsofacase-controlstudy.

Stat

Med,12:1325–1338,1993.

[118]

[14]J.

G.Booth

andS.Sarkar.

Monte

Carlo

approxim

ationofbootstrap

variances.

Am

Statistician,52:354–357,1998.

[108]

[15]R.Bordley.

Statistical

decisionmakingwithoutmath.Chance,20(3):39–44,2007.

[8]

[16]L.Breim

an.Thelittlebootstrap

andother

methodsfordim

ensionalityselectionin

regression:X-fixedprediction

error.

JAm

StatAssoc,

87:738–754,1992.

[67,68,111]

[17]L.Breim

anandJ.

H.Friedman.Estim

atingoptimal

transformationsformultiple

regressionandcorrelation(w

ith

discussion).

JAm

StatAssoc,

80:580–619,1985.

[82]

185

BIB

LIO

GRAPHY

186

[18]L.Breim

an,J.

H.Friedman,R.A.Olshen,andC.J.

Stone.

ClassificationandRegressionTrees.Wadsw

orth

and

Brooks/Cole,PacificGrove,CA,1984.

[43]

[19]W.M.BriggsandR.Zaretzki.Theskill

plot:

Agraphical

techniqueforevaluatingcontinuousdiagnostictests(w

ith

discussion).

Biometrics,64:250–261,2008.

[8]

[20]D.Brownstone.

Regressionstrategies.

InProceedingsofthe20th

Sym

posium

ontheInterfacebetweenComputer

Science

andStatistics,pages

74–79,Washington,DC,1988.American

Statistical

Association.

[118]

[21]P.Buettner,C.Garbe,

andI.Guggenmoos-Holzmann.Problemsin

definingcutoffpoints

ofcontinuousprognostic

factors:

Example

oftumor

thickn

essin

prim

arycutaneousmelanoma.

JClin

Epi,50:1201–1210,1997.

[26]

[22]J.

M.Cham

bersandT.J.

Hastie,

editors.

Statistical

Modelsin

S.Wadsw

orth

andBrooks/Cole,PacificGrove,CA,

1992.

[57]

[23]C.Chatfield.Avoidingstatisticalpitfalls

(withdiscussion).

Statistical

Sci,6:240–268,1991.

[90]

[24]C.Chatfield.Modeluncertainty,dataminingandstatisticalinference

(withdiscussion).

JRoy

StatSocA,158:419–

466,1995.

[65,118]

[25]S.Chatterjee

andB.Price.RegressionAnalysisby

Example.Wiley,New

York,

secondedition,1991.

[73]

[26]A.Ciampi,J.

Thiffault,J.-P.Nakache,andB.Asselain.Stratificationby

stepwiseregression,correspondence

analysis

andrecursivepartition.CompStatDataAnalysis,1986:185–204,1986.

[77]

[27]W.S.Cleveland.Robust

locally

weightedregressionandsm

oothingscatterplots.JAm

StatAssoc,

74:829–836,

1979.

[41]

[28]E.F.CookandL.Goldman.Asymmetricstratification:Anoutlineforan

efficientmethodforcontrollingconfounding

incohortstudies.

Am

JEpi,127:626–639,1988.

[45]

[29]J.

B.Copas.Regression,predictionandshrinkage(w

ithdiscussion).

JRoy

StatSocB,45:311–354,1983.

[71,72]

[30]J.

B.Copas.Cross-validationshrinkageofregressionpredictors.JRoy

StatSocB,49:175–183,1987.

[116]

[31]D.R.Cox.Regressionmodelsandlife-tables(w

ithdiscussion).

JRoy

StatSocB,34:187–220,1972.

[59]

[32]S.L.Crawford,S.L.Tennstedt,andJ.

B.McK

inlay.

Acomparisonofanalyticmethodsfornon-random

missingness

ofoutcomedata.

JClin

Epi,48:209–219,1995.

[94]

[33]N.J.

CrichtonandJ.

P.Hinde.

Correspondence

analysisas

ascreeningmethodforindicants

forclinical

diagnosis.

StatMed,8:1351–1362,1989.

[77]

[34]R.B.D’Agostino,A.J.

Belanger,E.W.Markson,M.Kelly-H

ayes,andP.A.Wolf.Developmentofhealthrisk

appraisalfunctionsin

thepresence

ofmultiple

indicators:

TheFramingham

Studynursinghomeinstitutionalization

model.StatMed,14:1757–1770,1995.

[74,76]

[35]C.E.Davis,J.

E.Hyde,

S.I.Bangdiwala,

andJ.

J.Nelson.Anexam

ple

ofdependencies

amongvariablesin

aconditional

logisticregression.In

S.MoolgavkarandR.Prentice,editors,

ModernStatistical

Methodsin

Chronic

Disease

Epidem

iology,

pages

140–147.Wiley,New

York,

1986.

[74]

[36]S.Derksen

andH.J.

Keselman.Backw

ard,forwardandstepwiseautomated

subsetselectionalgorithms:

Frequency

ofobtainingauthenticandnoisevariables.

British

JMathStatPsych,45:265–282,1992.

[66]

[37]T.F.Devlin

andB.J.

Weeks.Splinefunctionsforlogisticregressionmodeling.In

ProceedingsoftheEleventh

Annual

SASUsers

GroupInternational

Conference,pages

646–651,Cary,NC,1986.SASInstitute,Inc.

[35]

[38]W.D.Dupont.

Statistical

ModelingforBiomedical

Researchers.

Cam

bridgeUniversity

Press,Cam

bridge,

UK,

secondedition,2008.

[192]

[39]S.Durrleman

andR.Sim

on.Flexible

regressionmodelswithcubic

splines.StatMed,8:551–561,1989.

[38]

BIB

LIO

GRAPHY

187

[40]B.Efron.

Estim

atingtheerrorrate

ofapredictionrule:Im

provementoncross-validation.

JAm

StatAssoc,

78:316–331,1983.

[112,115,116]

[41]B.EfronandR.Tibshirani.AnIntroductionto

theBootstrap.Chapman

andHall,New

York,

1993.

[115]

[42]B.EfronandR.Tibshirani.

Improvements

oncross-validation:The.632+

bootstrap

method.JAm

StatAssoc,

92:548–560,1997.

[115]

[43]J.

Fan

andR.A.Levine.

Toam

nio

ornotto

amnio:That

isthedecisionforBayes.Chance,20(3):26–32,2007.[8]

[44]D.FaraggiandR.Sim

on.

Asimulationstudyofcross-validationforselectingan

optimal

cutpointin

univariate

survival

analysis.StatMed,15:2203–2213,1996.

[26]

[45]J.

J.Faraw

ay.Thecost

ofdataanalysis.JCompGraphStat,1:213–229,1992.

[97,115,117]

[46]V.Fedorov,

F.Mannino,andR.Zhang.Consequencesofdichotomization.Pharm

Stat,8:50–61,2009.

[7,26]

[47]D.Freedman,W.Navidi,andS.Peters.

OntheIm

pactofVariableSelectionin

FittingRegressionEquations,pages

1–16.Lecture

Notesin

EconomicsandMathem

atical

Systems.Springer-Verlag,New

York,

1988.

[116]

[48]J.

H.Friedman.Avariablespan

smoother.TechnicalReport5,Lab

oratoryforComputationalStatistics,Departm

ent

ofStatistics,Stanford

University,1984.

[82]

[49]M.H.GailandR.M.Pfeiffer.Oncriteria

forevaluatingmodelsofabsolute

risk.Biostatistics,6(2):227–239,2005.

[8]

[50]T.GneitingandA.E.Raftery.Strictlyproper

scoringrules,prediction,andestimation.JAm

StatAssoc,

102:359–

378,2007.

[8]

[51]U.S.Govindarajulu,D.Spiegelman,S.W.Thurston,B.Ganguli,

andE.A.Eisen.Comparingsm

oothingtechniques

inCox

modelsforexposure-response

relationships.

StatMed,26:3735–3752,2007.

[39]

[52]P.M.GrambschandP.C.O’Brien.Theeff

ectsoftransformationsandprelim

inarytestsfornon-linearity

inregression.

StatMed,10:697–709,1991.

[48,66]

[53]R.J.

Gray.

Flexible

methodsforanalyzingsurvival

datausingsplines,withapplicationsto

breast

cancerprognosis.

JAm

StatAssoc,

87:942–951,1992.

[56,72]

[54]R.J.

Gray.

Spline-based

testsin

survival

analysis.Biometrics,50:640–652,1994.

[56]

[55]M.J.

Greenacre.Correspondence

analysis

ofmultivariate

categorical

databy

weightedleast-squares.Biometrika,

75:457–467,1988.

[77]

[56]S.Greenland.When

should

epidem

iologicregressionsuse

random

coeffi

cients?Biometrics,56:915–921,2000.[66,

92]

[57]F.E.Harrell.

TheLOGIST

Procedure.In

SUGISupplementalLibrary

Users

Guide,

pages

269–293.SASInstitute,

Inc.,Cary,NC,Version5edition,1986.

[67]

[58]F.E.Harrell,

K.L.Lee,R.M.Califf,D.B.Pryor,andR.A.Rosati.Regressionmodelingstrategiesforim

proved

prognostic

prediction.StatMed,3:143–152,1984.

[69]

[59]F.E.Harrell,

K.L.Lee,D.B.Matchar,andT.A.Reichert.Regressionmodelsforprognosticprediction:Advantages,

problems,andsuggestedsolutions.

CancerTreatmentReports,69:1071–1077,1985.

[69]

[60]F.E.Harrell,

K.L.Lee,andB.G.Pollo

ck.Regressionmodelsin

clinical

studies:

Determiningrelationshipsbetween

predictors

andresponse.JNat

CancerInst,80:1198–1202,1988.

[42]

[61]F.E.Harrell,

P.A.Margolis,S.Gove,K.E.Mason,E.K.Mulholland,D.Lehmann,L.Muhe,

S.Gatchalian,and

H.F.Eichenwald.Developmentofaclinicalpredictionmodelforan

ordinaloutcome:

TheWorldHealthOrganization

ARIMulticentreStudyofclinical

signsandetiologic

agents

ofpneumonia,sepsis,

andmeningitisin

younginfants.

StatMed,17:909–944,1998.

[72,95]

BIB

LIO

GRAPHY

188

[62]T.Hastie,

R.Tibshirani,andJ.

H.Friedman.TheElements

ofStatistical

Learning.

Springer,New

York,

second

edition,2008.ISBN-10:0387848576;ISBN-13:978-0387848570.

[47]

[63]T.J.

HastieandR.J.

Tibshirani.

Generalized

AdditiveModels.

Chapman

&Hall/CRC,Boca

Raton,FL,1990.

ISBN

9780412343902.

[47]

[64]S.G.HilsenbeckandG.M.Clark.Practical

p-valueadjustmentforoptimallyselected

cutpoints.StatMed,15:103–

112,1996.

[26]

[65]W.Hoeff

ding.Anon-param

etrictest

ofindependence.AnnMathStat,19:546–557,1948.

[77]

[66]N.Hollander,W.Sauerbrei,andM.Schumacher.Confidence

intervalsfortheeff

ectofaprognostic

factor

after

selectionofan

‘optimal’cutpoint.

StatMed,23:1701–1713,2004.

[26,28]

[67]C.M.HurvichandC.L.Tsai.

Theim

pactofmodel

selectiononinference

inlinearregression.Am

Statistician,

44:214–217,1990.

[68]

[68]L.I.Iezzoni.Dim

ensionsofrisk.In

L.I.Iezzoni,editor,RiskAdjustmentforMeasuringHealthOutcomes,chapter2,

pages

29–118.FoundationoftheAmerican

CollegeofHealthcare

Executives,AnnArbor,MI,1994.

[13]

[69]J.

Karvanen

andF.E.Harrell.

Visualizingcovariates

inproportional

hazardsmodel.StatMed,28:1957–1966,2009.

PMID

19378282.

[100]

[70]W.A.Knaus,

F.E.Harrell,

J.Lynn,L.Goldman,R.S.Phillips,

A.F.Connors,

N.V.Daw

son,W.J.

Fulkerson,

R.M.Califf,N.Desbiens,

P.Layde,

R.K.Oye,P.E.Bellamy,

R.B.Hakim

,andD.P.Wagner.TheSUPPORT

prognostic

model:Objectiveestimates

ofsurvival

forseriouslyill

hospitalized

adults.

AnnIntMed,122:191–203,

1995.

[83,157]

[71]C.Kooperberg,C.J.

Stone,

andY.K.Truong.Hazardregression.JAm

StatAssoc,

90:78–94,1995.

[177]

[72]W.F.Kuhfeld.ThePRINQUALprocedure.In

SAS/STAT

9.2

User’sGuide.

SASPublishing,CaryNC,second

edition,2009.

[78]

[73]B.LausenandM.Schumacher.Evaluatingtheeff

ectofoptimized

cutoffvalues

intheassessmentofprognostic

factors.

CompStatDataAnalysis,1996.

[26]

[74]J.

F.Law

less

andK.Singhal.Efficientscreeningofnonnormal

regressionmodels.

Biometrics,34:318–327,1978.

[68]

[75]S.le

CessieandJ.

C.vanHouw

elingen.Ridgeestimatorsin

logisticregression.ApplStat,41:191–201,1992.

[72]

[76]A.Leclerc,D.Luce,F.Lert,

J.F.Chastang,andP.Logeay.

Correspondance

analysis

andlogisticmodelling:

Complementary

use

intheanalysisofahealthsurvey

amongnurses.StatMed,7:983–995,1988.

[77]

[77]S.Lee,J.

Z.Huang,andJ.

Hu.

Sparse

logisticprincipal

components

analysis

forbinarydata.

AnnApplStat,

4(3):1579–1601,2010.

[47]

[78]C.LengandH.Wang.Ongeneraladaptive

sparse

principalcomponentanalysis.JCompGraphStat,18(1):201–215,

2009.

[47]

[79]X.Luo,L.A.Stfanski,andD.D.Boos.

Tuningvariable

selectionproceduresby

addingnoise.

Technometrics,

48:165–175,2006.

[15]

[80]N.Mantel.Whystepdow

nproceduresin

variable

selection.Technometrics,12:621–625,1970.

[68]

[81]S.E.MaxwellandH.D.Delaney.Bivariate

mediansplitsandspuriousstatisticalsignificance.PsychologicalBulletin,

113:181–190,1993.

[26]

[82]G.P.McC

abe.

Principal

variables.

Technometrics,26:137–144,1984.

[76]

[83]G.Michailidis

andJ.

deLeeuw

.TheGifi

system

ofdescriptive

multivariate

analysis.Statistical

Sci,13:307–336,

1998.

[77,77]

BIB

LIO

GRAPHY

189

[84]B.K.MoserandL.P.Coombs.

Oddsratiosforacontinuousoutcomevariable

withoutdichotomizing.StatMed,

23:1843–1860,2004.

[26]

[85]R.H.Myers.Classical

andModernRegressionwithApplications.

PWS-K

ent,Boston,1990.

[73]

[86]N.J.

D.Nagelkerke.

Anote

onageneral

definitionofthecoeffi

cientofdetermination.Biometrika,78:691–692,

1991.

[91]

[87]D.Paul,

E.Bair,

T.Hastie,

and

R.Tibshirani.

“preconditioning”forfeature

selection

and

regression

inhigh-

dim

ensional

problems.

AnnStat,36(4):1595–1619,2008.

[47]

[88]P.Peduzzi,

J.Concato,A.R.Feinstein,andT.R.Holford.

Importance

ofevents

per

independentvariable

inproportional

hazardsregressionanalysis.II.Accuracy

andprecisionofregressionestimates.JClin

Epi,48:1503–

1510,1995.

[69]

[89]P.Peduzzi,J.

Concato,E.Kem

per,T.R.Holford,andA.R.Feinstein.Asimulationstudyofthenumber

ofevents

per

variable

inlogisticregressionanalysis.JClin

Epi,49:1373–1379,1996.

[69,69]

[90]N.Peek,

D.G.T.Arts,R.J.

Bosm

an,P.H.J.

vander

Voort,andN.F.deKeizer.External

validationofprognostic

modelsforcritically

illpatients

required

substantial

sample

sizes.

JClin

Epi,60:491–501,2007.

[92]

[91]M.J.

Pencina,

R.B.D’AgostinoSr,R.B.D’AgostinoJr,andR.S.Vasan.Evaluatingtheadded

predictive

ability

ofanew

marker:

From

area

under

theROCcurveto

reclassificationandbeyond.StatMed,27:157–172,2008.[92]

[92]P.Radchenko

andG.M.James.Variableinclusionandshrinkagealgorithms.JAm

StatAssoc,103(483):1304–1315,

2008.

[46]

[93]D.R.Ragland.

Dichotomizingcontinuousoutcomevariables:

Dependence

ofthemagnitudeofassociationand

statisticalpow

eronthecutpoint.

Epidem

iology,

3:434–440,1992.

[26]

[94]B.M.Reilly

andA.T.Evans.

Translatingclinical

research

into

clinical

practice:Im

pactofusingpredictionrulesto

makedecisions.

AnnIntMed,144:201–209,2006.

[10]

[95]E.B.Roecker.

Predictionerroranditsestimationforsubset-selected

models.

Technometrics,33:459–468,1991.

[67,111]

[96]P.Royston,D.G.Altman,andW.Sauerbrei.Dichotomizingcontinuouspredictors

inmultiple

regression:abad

idea.StatMed,25:127–141,2006.

[26]

[97]W.S.Sarle.TheVARCLUSprocedure.In

SAS/STAT

User’sGuide,

volume2,chapter43,pages

1641–1659.SAS

Institute,Inc.,CaryNC,fourthedition,1990.

[74,76]

[98]W.Sauerbrei

andM.Schumacher.A

bootstrap

resamplingprocedure

formodel

building:Applicationto

theCox

regressionmodel.StatMed,11:2093–2109,1992.

[68,112]

[99]G.Schulgen,B.Lausen,J.

Olsen,andM.Schumacher.Outcome-orientedcutpoints

inquantitative

exposure.Am

JEpi,120:172–184,1994.

[26,28]

[100]J.

Shao.Linearmodel

selectionby

cross-validation.JAm

StatAssoc,

88:486–494,1993.

[112]

[101]L.R.Smith,F.E.Harrell,

andL.H.Muhlbaier.Problemsandpotentialsin

modelingsurvival.In

M.L.Grady

andH.A.Schwartz,editors,

Medical

EffectivenessResearchDataMethods(SummaryReport),AHCPR

Pub.No.

92-0056,

pages

151–159.USDept.

ofHealthandHuman

Services,

Agency

forHealthCarePolicyandResearch,

Rockville,MD,1992.

Available

from

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/FrankHarrell/

smi92pro.pdf.

[69]

[102]I.Spence

andR.F.Garrison.Aremarkable

scatterplot.

Am

Statistician,47:12–19,1993.

[90]

[103]D.J.

Spiegelhalter.

Probabilistic

predictionin

patientmanagem

entandclinical

trials.StatMed,5:421–433,1986.

[71,96,115,116]

BIB

LIO

GRAPHY

190

[104]E.W.Steyerberg.Clinical

PredictionModels.

Springer,New

York,

2009.

[2,192]

[105]E.W.Steyerberg,M.J.

C.Eijkemans,

F.E.Harrell,

andJ.

D.F.Habbem

a.Prognostic

modellingwithlogistic

regressionanalysis:Acomparisonofselectionandestimationmethodsin

smalldatasets.StatMed,19:1059–1079,

2000.

[46]

[106]C.J.

Stone.

Comment:

Generalized

additivemodels.

Statistical

Sci,1:312–314,1986.

[38]

[107]C.J.

StoneandC.Y.Koo.Additivesplines

instatistics.In

ProceedingsoftheStatistical

ComputingSectionASA,

pages

45–48,Washington,DC,1985.

[34,39]

[108]S.Suissa

andL.Blais.Binaryregressionwithcontinuousoutcomes.StatMed,14:247–255,1995.

[26]

[109]G.Sun,T.L.Shook,

andG.L.Kay.

Inappropriate

use

ofbivariable

analysis

toscreen

risk

factorsforuse

inmultivariable

analysis.JClin

Epi,49:907–916,1996.

[70]

[110]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JRoy

StatSocB,58:267–288,1996.

[46]

[111]J.

C.vanHouw

elingen

andS.le

Cessie.

Predictive

valueofstatisticalmodels.

StatMed,9:1303–1325,1990.

[39,

72,72,112,116,117]

[112]P.VerweijandH.C.vanHouw

elingen.Penalized

likelihoodin

Cox

regression.StatMed,13:2427–2436,1994.[72]

[113]A.J.

Vickers.Decisionanalysisfortheevaluationofdiagnostictests,predictionmodels,andmolecularmarkers.Am

Statistician,62(4):314–320,2008.

[8]

[114]E.VittinghoffandC.E.McC

ullo

ch.Relaxingtherule

oftenevents

per

variable

inlogisticandCox

regression.Am

JEpi,165:710–718,2006.

[69]

[115]H.Wainer.

Findingwhat

isnottherethroughtheunfortunatebinningofresults:

TheMendel

effect.

Chance,

19(1):49–56,2006.

[26,29]

[116]H.WangandC.Leng.Unified

LASSO

estimationby

leastsquares

approxim

ation.JAm

StatAssoc,

102:1039–1048,

2007.

[46]

[117]S.Wang,B.Nan,N.Zhou,andJ.

Zhu.Hierarchically

penalized

Cox

regressionwithgrouped

variables.

Biometrika,

96(2):307–322,2009.

[46]

[118]Y.Wax.Collinearity

diagnosisforarelative

risk

regressionanalysis:Anapplicationto

assessmentofdiet-cancer

relationship

inepidem

iological

studies.

StatMed,11:1273–1287,1992.

[74]

[119]J.

Whitehead.Sam

ple

size

calculationsforordered

categorical

data.

StatMed,12:2257–2271,1993.

[69]

[120]R.E.Wiegand.Perform

ance

ofusingmultiple

stepwisealgorithmsforvariable

selection.StatMed,29:1647–1659,

2010.

[68]

[121]D.M.WittenandR.Tibshirani.

Testingsignificance

offeaturesby

lassoed

principal

components.AnnApplStat,

2(3):986–1012,2008.

[47]

[122]S.N.Wood.Generalized

AdditiveModels:

AnIntroductionwithR.Chapman

&Hall/CRC,Boca

Raton,FL,2006.

ISBN

9781584884743.

[47]

[123]C.F.J.

Wu.Jackkn

ife,

bootstrap

andother

resamplingmethodsin

regressionanalysis.AnnStat,14(4):1261–1350,

1986.

[112]

[124]S.Xiong.Somenotesonthenonnegativegarrote.Technometrics,2010.

[47]

[125]J.

Ye.

Onmeasuringandcorrectingtheeff

ects

ofdataminingandmodel

selection.JAm

StatAssoc,

93:120–131,

1998.

[15]

[126]F.W.Young,Y.Takane,

andJ.

deLeeuw

.Theprincipal

components

ofmixed

measurementlevelmultivariate

data:

Analternatingleastsquares

methodwithoptimal

scalingfeatures.

Psychometrika,43:279–281,1978.

[77]

BIB

LIO

GRAPHY

191

[127]H.H.ZhangandW.Lu.Adaptive

lassoforCox’sproportional

hazardsmodel.Biometrika,94:691–703,2007.

[46]

[128]H.Zhou,T.Hastie,

andR.Tibshirani.Sparse

principal

componentanalysis.JCompGraphStat,15:265–286,2006.

[47]

[129]H.ZouandT.Hastie.

Regularizationandvariable

selectionviatheelasticnet.JRoy

StatSocB,67(2):301–320,

2005.

[46]

BIB

LIO

GRAPHY

192

Rpackageswritten

byFEHarrellarefreely

availablefrom

CRAN.

Toobtain

a588-pagebook

withdetailedexam

plesandcase

studiesandnotes

onthetheory

andapplicationsof

survival

analysis,logisticregression,andlinearmodels,order

Regres-

sion

ModelingStrategieswithApplicationsto

LinearMod-

els,LogisticRegression,andSurvivalAnalysisby

FEHarrell

from

SpringerNY(2001).Steyerberg104andDupont3

8are

excellenttextsforaccompanyingthebook.

Toobtainaglossary

ofstatisticaltermsandotherhandoutsrelatedto

diagnosticandprognosticmodeling,

pointyourWeb

brow

serto

biostat.mc.vanderbilt.edu/ClinStat.

rms - university of kentuckymai/sta665/course2.pdf · tt in g o r di n a l pre d i c to rs. . . . ....

Documents