rms - university of kentuckymai/sta665/course2.pdf · tt in g o r di n a l pre d i c to rs. . . . ....
TRANSCRIPT
RegressionModelingStrategies
usingtheRPackagerms
Frank
EHarrellJr
Departm
entof
Biostatistics
Vanderbilt
University
School
ofMedicine
Nashville
TN
37232
biostat.mc.vanderbilt.edu/rms
useR!
TheRUserConference
UniversityofW
arwick
CoventryUK
15August
2011
Copyright1995-2011FEHarrell
AllRights
Reserved
Contents
1Introduction
3
1.1
HypothesisTesting,Estimation,
andPrediction.
3
1.2
Examples
ofUsesof
PredictiveMultivariable
Modeling
....................
5
1.3
Misunderstandings
aboutPredictionvs.Classi-
fication
.....................
6
1.4
PlanningforModeling
.............
10
1.5
Choiceof
theModel
..............
14
1.6
Modeluncertainty/Data-driven
ModelSpeci-
fication
.....................
15
2GeneralAsp
ectsofFittingRegressionModels
16
2.1
NotationforMultivariableRegressionModels..
16
2.2
ModelFormulations
...............
17
ii
2.3
Interpreting
ModelParam
eters
.........
18
2.3.1
Nom
inalPredictors
...........
19
2.3.2
Interactions
...............
20
2.3.3
Example:
InferenceforaSimpleModel
.21
2.4
Reviewof
Com
posite(Chunk)Tests
......
25
2.5
RelaxingLinearity
Assum
ptionforContinuous
Predictors....................
25
2.5.1
AvoidingCategorization.........
25
2.5.2
SimpleNonlinearTerms.........
30
2.5.3
Splines
forEstimatingShapeof
Regres-
sion
FunctionandDeterminingPredic-
torTransform
ations
...........
31
2.5.4
CubicSplineFunctions
.........
34
2.5.5
RestrictedCubicSplines
........
34
2.5.6
ChoosingNum
ber
andPositionof
Knots
38
2.5.7
NonparametricRegression
.......
40
2.5.8
Advantagesof
RegressionSplines
over
Other
Methods
.............
42
iii
2.6
Recursive
Partitioning:
Tree-Based
Models...
43
2.7
New
Directionsin
PredictiveModeling
.....
45
2.8
MultipleDegreeof
Freedom
Tests
ofAssociation
48
2.9
Assessm
entof
ModelFit
............
50
2.9.1
RegressionAssum
ptions
.........
50
2.9.2
ModelingandTesting
Com
plex
Interac-
tions
..................
55
2.9.3
Fitting
OrdinalPredictors........
58
2.9.4
DistributionalAssum
ptions
.......
58
3Multivariable
ModelingStrategies
60
3.1
Prespecification
ofPredictor
Com
plexityWith-
outLater
Simplification
.............
61
3.1.1
LearningFrom
aSaturated
Model
...
62
3.1.2
Using
MarginalGeneralized
RankCor-
relations.................
63
3.2
CheckingAssum
ptions
ofMultiplePredictorsSi-
multaneously
..................
65
3.3
VariableSelection
................
65
iv
3.4
OverfittingandLimitson
Num
ber
ofPredictors
69
3.5
Shrinkage
....................
70
3.6
Collinearity
...................
72
3.7
DataReduction
.................
74
3.7.1
RedundancyAnalysis
..........
75
3.7.2
VariableClustering
...........
76
3.7.3
Transform
ationandScalingVariablesWith-
outUsing
Y..............
77
3.7.4
SimultaneousTransform
ationandIm
pu-
tation
..................
79
3.7.5
SimpleScoring
ofVariableClusters...
84
3.7.6
Simplifying
Cluster
Scores........
85
3.7.7
How
MuchDataReduction
IsNecessary?
85
3.8
OverlyInfluentialObservations.........
88
3.9
Com
paring
TwoModels.............
90
3.10
Sum
mary:
PossibleModelingStrategies
....
92
3.10.1
DevelopingPredictiveModels......
94
3.10.2
DevelopingModelsforEffectEstimation
97
v
3.10.3
DevelopingModelsforHypothesisTesting
98
4Describing,Resampling,Validating,andSim
pli-
fyingtheModel
99
4.1
DescribingtheFittedModel
..........
99
4.1.1
Interpreting
Effects
...........
99
4.1.2
Indexesof
ModelPerform
ance
.....100
4.2
The
Bootstrap
.................103
4.3
ModelValidation
................108
4.3.1
Introduction
...............108
4.3.2
Which
QuantitiesShouldBeUsedinVal-
idation?
.................109
4.3.3
Data-Splitting
..............110
4.3.4
Improvem
entson
Data-Splitting:Resam
-pling
..................112
4.3.5
ValidationUsing
theBootstrap
.....113
4.4
Simplifying
theFinalModelby
ApproximatingIt
118
4.4.1
DifficultiesUsing
FullModels......118
vi
4.4.2
ApproximatingtheFullModel
.....119
4.5
How
DoWeBreak
Bad
Habits?
........120
5SSoftware
122
5.1
The
SModelingLanguage
...........123
5.2
User-Contributed
Functions
...........124
5.3
The
rmsPackage
................126
5.4
Other
Functions
................131
6LogisticModelCase
Study:SurvivalofTitanic
Passengers
132
6.1
Descriptive
Statistics
..............132
6.2
Exploring
TrendswithNonparametricRegression135
6.3
BinaryLogisticModel
withCasew
iseDeletion
ofMissing
Values................136
6.4
ExaminingMissing
DataPatterns........142
6.5
SingleConditionalMeanIm
putation
......146
6.6
MultipleIm
putation
...............150
6.7
Sum
marizingtheFittedModel
.........153
vii
7Case
Studyin
ParametricSurvivalModelingand
ModelApproxim
ation
157
7.1
Descriptive
Statistics
..............158
7.2
CheckingAdequacyof
Log-NormalAccelerated
Failure
TimeModel
...............163
7.3
Sum
marizingtheFittedModel
.........173
7.4
Internal
Validationof
theFittedModel
Using
theBootstrap
..................174
7.5
ApproximatingtheFullModel
.........178
Bibliography
.....................191
viii
1
CoursePhiloso
phy
�Satisfactionof
modelassumptions
improves
preci-
sion
andincreasesstatisticalpow
er
�It
ismoreproductive
tomakeamodel
fitstep
bystep
(e.g.,transformationestimation)
than
topostulate
asimplemodelandfind
outwhatwent
wrong
�Graphical
methods
should
bemarried
toform
alinference
�Overfittingoccurs
frequently,so
data
reduction
andmodelvalidationareimportant
�Softwarewithout
multiplefacilitiesforassessing
andfixing
model
fitmay
only
seem
tobeuser-
friendly
�Carefullyfittingan
improper
modelisbetterthan
badlyfitting(and
overfitting)
awell-chosen
one
�Methods
which
workforalltypes
ofregression
modelsarethemostvaluable.
�In
mostresearch
projects
thecost
ofdata
collec-
tion
faroutweighsthecostofdata
analysis,so
itis
2
important
tousethemosteffi
cientandaccurate
modelingtechniques,to
avoidcategorizing
contin-
uous
variables,andto
notremovedata
from
the
estimationsamplejust
tobeableto
validatethe
model.
�The
bootstrap
isabreakthrough
forstatisticalm
od-
elingandmodelvalidation.
�Using
thedata
toguidethedata
analysisisalmost
asdangerousas
notdoingso.
�A
good
overallstrategy
isto
decide
how
many
degreesof
freedom
(i.e.,number
ofregression
pa-
rameters)
canbe“spent”,where
they
should
be
spent,to
spendthem
withno
regrets.
See
theexcellent
text
ClinicalPredictionModelsby
Steyerberg104 .
Chapter1
Introduction
1.1
Hypoth
esisTesting,Estim
ation,and
Prediction
EvenwhenonlytestingH0amodelbasedapproach
hasadvantages:
�Permutationandrank
testsnotas
useful
foresti-
mation
�Cannotreadily
beextended
toclustersamplingor
repeatedmeasurements
�Modelsgeneralizetests
–2-samplet-test,ANOVA→
multiplelinearregression
3
CHAPTER
1.
INTRODUCTIO
N4
–Wilcoxon,Kruskal-W
allis,Spearm
an→
proportionalodds
ordinallogisticmodel
–log-rank→
Cox
�Modelsnotonlyallowformultiplicityadjustment
butforshrinkageof
estimates
–Statisticians
comfortable
withP-value
adjust-
mentbutfailto
recognizethat
thedifference
betweenthemostdifferenttreatm
ents
isbadly
biased
Statisticalestimationisusually
model-based
�Relativeeffectof
increasing
cholesterolfrom
200
to250mg/dl
onhazard
ofdeath,
holdingother
risk
factorsconstant
�Adjustm
entdepends
onhowotherrisk
factorsre-
late
tohazard
�Usuallyinterested
inadjusted
(partial)effects,not
unadjusted
(marginalor
crude)
effects
CHAPTER
1.
INTRODUCTIO
N5
1.2
ExamplesofUsesofPredictiveM
ultivariable
Modeling
�Financialperform
ance,consum
erpurchasing,loan
pay-back
�Ecology
�Product
life
�Employmentdiscrimination
�Medicine,epidem
iology,health
services
research
�Probabilityof
diagnosis,timecourse
ofadisease
�Com
paring
non-random
ized
treatm
ents
�Getting
thecorrectestimateof
relative
effects
inrandom
ized
studiesrequires
covariableadjustment
ifmodelisnonlinear
–Crude
odds
ratios
biased
towards
1.0ifsample
heterogeneous
�Estimatingabsolute
treatm
enteffect(e.g.,
risk
difference)
–Use
e.g.difference
intwopredictedprobabilities
�Cost-effectiveness
ratios
CHAPTER
1.
INTRODUCTIO
N6
–increm
entalcost/increm
entalA
BSOLUTEben-
efit
–moststudiesuseavg.costdifference
/avg.ben-
efit,which
may
applyto
noone
1.3
MisunderstandingsaboutPrediction
vs.
Classifica
tion
�Manyanalysts
desire
todevelop“classifiers”
in-
steadof
predictions
�Supposethat
1.response
variableisbinary
2.thetwolevelsrepresentasharpdichotom
ywith
nogray
zone
(e.g.,completesuccessvs.total
failure
withno
possibilityof
apartialsuccess)
3.oneisforced
toassign
(classify)
future
observa-
tionsto
onlythesetwochoices
4.thecostofmisclassification
isthesameforevery
future
observation,
andtheratioof
thecost
ofafalsepositiveto
thecost
ofafalsenegative
equalsthe(often
hidden)ratioimpliedby
the
analyst’sclassification
rule
CHAPTER
1.
INTRODUCTIO
N7
�Thenclassification
isstillsuboptimalfordriving
thedevelopm
entof
apredictive
instrumentas
well
asforhypothesistestingandestimation
�Farbetteristo
usethefullinform
ationinthedata
todevelopeaprobabilitymodel,then
developclas-
sification
ruleson
thebasisof
estimated
probabil-
ities
–↑pow
er,↑precision
�Classification
ismoreproblematicifresponse
vari-
able
isordinalor
continuous
orthegroups
are
nottrulydistinct
(e.g.,diseaseor
nodiseasewhen
severityofdiseaseison
acontinuum);dichotom
iz-
ingitup
frontfortheanalysisisnotappropriate
–minimumlossofinform
ation(w
hendichotom
iza-
tion
isat
themedian)
islarge
–may
requirethesamplesize
toincrease
many–
fold
tocompensate
forloss
ofinform
ation46
�Two-groupclassification
representsartificialforced
choice
–bestoption
may
be“nochoice,getmoredata”
CHAPTER
1.
INTRODUCTIO
N8
�Unlikeprediction
(e.g.,
ofabsolute
risk),
classi-
fication
implicitly
uses
utility
(loss;
cost
offalse
positiveor
falsenegative)functions
�Hiddenproblems:
–Utilityfunction
depends
onvariablesnotcol-
lected
(subjects’preferences)
that
areavailable
onlyat
thedecision
point
–Assum
eseverysubjecthasthesameutilityfunc-
tion
–Assum
esthisfunction
coincideswiththeana-
lyst’s
�Formaldecision
analysisuses
–optimum
predictionsusingallavailabledata
–subject-specificutilities,which
areoftenbased
onvariablesnotpredictive
oftheoutcom
e
�ROCanalysisismisleadingexcept
forthespecial
case
ofmassone-timegroupdecision
makingwith
unknow
ableutilities
See
15,19,43,49,50,113.
CHAPTER
1.
INTRODUCTIO
N9
Accuracyscoreused
todrivemodelbuildingshould
beacontinuous
scorethat
utilizesalloftheinform
a-tion
inthedata.
TheDichoto
mizingM
oto
rist
�The
speedlim
itis60.
�Iam
goingfaster
than
thespeedlim
it.
�WillIbecaught?
Ananswer
byadichotom
izer:
�Are
yougoingfaster
than
70?
Ananswer
from
abetterdichotom
izer:
�Ifyouaream
ongothercars,areyougoingfaster
than
73?
�Ifyouareexposed
areyour
goingfaster
than
67?
Better:
�How
fast
areyougoingandareyouexposed?
CHAPTER
1.
INTRODUCTIO
N10
Analogy
tomostmedicaldiagnosisresearch
inwhich
+/-
diagnosisisafalsedichotom
yof
anunderlying
diseaseseverity:
�The
speedlim
itismoderatelyhigh.
�Iam
goingfairlyfast.
�WillIbecaught?
1.4
PlanningforM
odeling
�Chancethat
predictive
modelwillbeused
94
�Response
definition,follow-up
�Variabledefinitions
�Observervariability
�Missing
data
�Preferenceforcontinuous
variables
�Subjects
�Sites
CHAPTER
1.
INTRODUCTIO
N11
Whatcankeep
asampleof
data
from
being
appro-
priate
formodeling:
1.Mostimportant
predictororresponse
variablesnot
collected
2.Subjectsin
thedatasetareill-definedor
notrep-
resentativeof
thepopulationto
which
inferences
areneeded
3.Datacollectionsitesdo
notrepresentthepopula-
tion
ofsites
4.Key
variablesmissing
inlargenumbersof
subjects
5.Datanotmissing
atrandom
6.Nooperationaldefinitionsforkeyvariablesand/or
measurementerrorssevere
7.Noobserver
variability
studiesdone
Whatelse
cango
wrong
inmodeling?
1.The
processgenerating
thedata
isnotstable.
2.The
model
ismisspecified
withregard
tonon-
linearities
orinteractions,or
therearepredictors
CHAPTER
1.
INTRODUCTIO
N12
missing.
3.The
model
ismisspecified
interm
sof
thetrans-
form
ationof
theresponse
variableor
themodel’s
distributionalassumptions.
4.The
modelcontains
discontinuities(e.g.,by
cate-
gorizing
continuous
predictorsor
fittingregression
shapes
withsudden
changes)
that
canbegamed
byusers.
5.Correlationsam
ongsubjects
arenotspecified,or
thecorrelationstructureismisspecified,resulting
inineffi
cientparameter
estimates
andoverconfi-
dent
inference.
6.The
model
isoverfitted,resultingin
predictions
that
aretooextrem
eor
positiveassociations
that
arefalse.
7.The
user
ofthemodel
relieson
predictionsob-
tained
byextrapolatingto
combinationsof
predic-
torvalues
welloutsidetherangeof
thedataset
used
todevelopthemodel.
8.Accurateanddiscriminatingpredictionscanlead
CHAPTER
1.
INTRODUCTIO
N13
tobehaviorchangesthat
makefuture
predictions
inaccurate.
Iezzoni68lists
thesedimensionsto
capture,
forpa-
tientoutcom
estudies:
1.age
2.sex
3.acuteclinicalstability
4.principaldiagnosis
5.severity
ofprincipaldiagnosis
6.extent
andseverity
ofcomorbidities
7.physicalfunctionalstatus
8.psychological,
cognitive,
andpsychosocial
func-
tioning
9.cultural,ethnic,andsocioeconomicattributes
and
behaviors
10.healthstatus
andqualityof
life
11.patient
attitudesandpreferencesforoutcom
es
CHAPTER
1.
INTRODUCTIO
N14
Generalaspects
tocapturein
thepredictors:
1.baselinemeasurementof
response
variable
2.currentstatus
3.trajectory
asof
timezero,or
past
levelsof
akey
variable
4.variablesexplaining
muchof
thevariationin
the
response
5.moresubtlepredictorswhosedistributionsstrongly
differ
betweenlevelsof
thekeyvariableof
interest
inan
observationalstudy
1.5
Choiceofth
eM
odel
�In
biostatisticsandepidem
iology
andmostother
areasweusually
choose
modelem
pirically
�Modelmustusedata
efficiently
�Shouldmodel
overallstructure(e.g.,
acutevs.
chronic)
�Robustmodelsarebetter
CHAPTER
1.
INTRODUCTIO
N15
�Shouldhave
correctmathematicalstructure(e.g.,
constraintson
probabilities)
1.6
Modelunce
rtainty
/Data-d
riven
ModelSpecifica
tion
�Standarderrors,C.L.,P-values,R2wrong
ifcom-
putedas
ifthemodelpre-specified
�Stepw
isevariableselectioniswidelyused
andabused
�Bootstrap
canbeused
torepeatallanalysissteps
toproperlypenalizevariances,etc.
�Ye125 :
“generalized
degreesof
freedom”(GDF)
forany“data
mining”or
modelselectionprocedure
basedon
leastsquares
–Example:
20candidatepredictors,n=
22,for-
wardstepwise,best5-variablemodel:GDF=14.1
–Example:
CART,10
candidatepredictors,n=
100,
19nodes:
GDF=76
�See
79foran
approach
involvingadding
noiseto
Yto
improvevariableselection
Chapter2
Genera
lAsp
ectsofFittingRegression
Models
2.1
Notation
forM
ultivariable
Regression
Models
�Weightedsum
ofasetof
independent
orpredictor
variables
�Interpretparametersandstateassumptions
bylin-
earizing
model
withrespectto
regression
coeffi
-cients
�Analysisofvariance
setups,interaction
effects,non-
lineareffects
�Examiningthe2regression
assumptions
16
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
17
Yresponse
(dependent)variable
XX1,X2,...,Xp–listof
predictors
ββ0,β1,...,βp–regression
coeffi
cients
β0
interceptparameter(optional)
β1,...,βpweights
orregression
coeffi
cients
Xβ
β0+β1X
1+...+
βpXp,X
0=1
Model:connection
betweenX
andY
C(Y|X
):property
ofdistribution
ofY
givenX,
e.g.
C(Y|X
)=E(Y|X
)or
Prob{Y
=1|X}.
2.2
ModelForm
ulations
Generalregression
model
C(Y|X
)=g(X
).
Generallinearregression
model
C(Y|X
)=g(X
β).
Examples
C(Y|X
)=
E(Y|X
)=
Xβ,
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
18
Y|X
∼N(X
β,σ
2 )
C(Y|X
)=
Prob{Y
=1|X}=
(1+exp(−
Xβ))−1
Linearize:h(C
(Y|X
))=Xβ,h
(u)=g−1 (u)
Example:
C(Y|X
)=Prob{Y
=1|X}=
(1+exp(−
Xβ))−1
h(u)=logit(u)=
log(
u
1−u)
h(C
(Y|X
))=
C′ (Y|X
)(link)
Generallinearregression
model:
C′ (Y|X
)=Xβ.
2.3
Interp
retingM
odelPara
meters
Supposethat
Xjislinearanddoesn’tinteract
with
otherX’s.
C′ (Y|X
)=
Xβ=β0+β1X
1+...+
βpXp
βj=
C′ (Y|X
1,X2,...,Xj+1,...,Xp)
−C′ (Y|X
1,X2,...,Xj,...,Xp)
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
19
Drop′from
C′andassumeC(Y|X
)isproperty
ofY
that
islinearlyrelatedto
weightedsum
ofX’s.
2.3.1
NominalPredictors
Nom
inal
(polytom
ous)
factor
withklevels:k−
1dummyvariables.
E.g.T=J,K
,L,M
:
C(Y|T
=J)=
β0
C(Y|T
=K)=
β0+β1
C(Y|T
=L)=
β0+β2
C(Y|T
=M
)=
β0+β3.
C(Y|T)=Xβ=β0+β1X
1+β2X
2+β3X
3,
where
X1=1if
T=K,0
otherwise
X2=1
ifT=L,0
otherwise
X3=1if
T=M,0
otherwise.
The
test
foranydifferencesin
theproperty
C(Y
)betweentreatm
ents
isH0:β1=β2=β3=0.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
20
2.3.2
Intera
ctions
X1andX2,
effectof
X1on
Ydepends
onlevel
ofX2.
Oneway
todescribeinteractionis
toadd
X3=X1X
2to
model:
C(Y|X
)=β0+β1X
1+β2X
2+β3X
1X2.
C(Y|X
1+1,X2)−
C(Y|X
1,X2)
=β0+β1(X1+1)
+β2X
2
+β3(X1+1)X2
−[β0+β1X
1+β2X
2+β3X
1X2]
=β1+β3X
2.
One-unitincrease
inX2on
C(Y|X
):β2+β3X
1.Worse
interactions:
IfX1isbinary,theinteractionmay
take
theform
ofadifference
inshape(and/ordistribution)of
X2
vs.C(Y
)depending
onwhether
X1=
0or
X1=
1(e.g.logarithm
vs.square
root).
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
21
2.3.3
Example:In
ference
foraSim
ple
Model
PostulatedthemodelC(Y|age,sex)=β0+
β1age+
β2(sex=
f)+β3age(sex=
f)where
sex=
fis
adummyindicatorvariableforsex=
female,i.e.,the
referencecellissex=
malea.
Modelassumes
1.ageislinearlyrelatedto
C(Y
)formales,
2.ageislinearlyrelatedto
C(Y
)forfemales,and
3.interactionbetweenageandsexissimple
4.whateverdistribution,variance,andindependence
assumptions
areappropriateforthemodel
being
considered.
Interpretationsof
parameters:
Param
eter
Meaning
β0
C(Y|age=
0,sex=
m)
β1
C(Y|age=
x+1,sex=
m)−C(Y|age=
x,sex
=m)
β2
C(Y|age=
0,sex=
f)−
C(Y|age=
0,sex=
m)
β3
C(Y|age=
x+1,sex=
f)−C(Y|age=
x,sex
=f)−
[C(Y|age=
x+1,sex=
m)−C(Y|age=
x,sex
=m)]
β3isthedifference
inslopes
(fem
ale–male).
aYouca
nalsoth
inkofth
elast
part
ofth
emodel
asbeingβ3X
3,whereX
3=
age×
I[sex
=f].
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
22
Whenahigh-order
effectsuch
asan
interactionef-
fect
isin
themodel,besure
tointerpretlow-order
effectsby
findingoutwhatmakes
theinteractionef-
fect
ignorable.
Inourexam
ple,theinteractioneffect
iszero
whenage=
0or
sexismale.
Hypothesesthat
areusually
inappropriate:
1.H0:β1=0:
Thistestswhether
ageisassociated
withY
formales
2.H0:β2=0:
Thistestswhether
sexisassociated
withY
forzero
year
olds
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
23
Moreuseful
hypothesesfollow.For
anyhypothesis
need
to
�Write
whatisbeing
tested
�Translate
toparameterstested
�Listthealternativehypothesis
�Not
forget
whatthetest
ispow
ered
todetect
–Testagainstnonzeroslopehasmaximum
pow
erwhenlinearity
holds
–Iftrue
relationship
ismonotonic,test
fornon-
flatness
willhave
somebutnotoptimalpow
er
–Testagainstaquadratic(parabolic)shapewill
have
somepow
erto
detect
alogarithmicshape
butnotagainstasine
waveover
manycycles
�Usefulto
write
e.g.
“Ha:ageisassociated
with
C(Y
),pow
ered
todetect
alinearrelationship”
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
24
MostUsefulTests
forLinearage×
sexModel
Nullor
AlternativeHypothesis
Mathem
atical
Statement
Effectof
ageisindependentof
sexor
H0:β3=
0Effectof
sexisindependentof
ageor
ageandsexareadditive
ageeff
ects
areparallel
ageinteractswithsex
Ha:β36=
0agemodifies
effectof
sex
sexmodifies
effectof
age
sexandagearenon-additive(synergistic)
ageisnot
associated
withY
H0:β1=
β3=
0ageisassociated
withY
Ha:β16=
0or
β36=
0ageisassociated
withY
foreither
females
ormales
sexisnot
associated
withY
H0:β2=
β3=
0sexisassociated
withY
Ha:β26=
0or
β36=
0sexisassociated
withY
forsome
valueof
age
Neither
agenor
sexisassociated
withY
H0:β1=
β2=
β3=
0Either
ageor
sexisassociated
withY
Ha:β16=
0or
β26=
0or
β36=
0
Note:The
last
test
iscalledtheglobal
test
ofno
association.
Ifan
interactioneffectpresent,thereis
bothan
ageandasexeffect.
There
canalso
beage
orsexeffects
whenthelines
areparallel.The
global
test
ofassociation(testof
totalassociation)
has3
d.f.insteadof
2(age
+sex)
because
itallowsfor
unequalslopes.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
25
2.4
Review
ofComposite
(Chunk)Tests
�In
themodel
y∼
age
+sex
+weight+
waist+
tric
ep
wemay
wantto
jointlytest
theassociationbe-
tweenallbodymeasurementsandresponse,hold-
ingageandsexconstant.
�This3d.f.test
may
beobtained
twoways:
–Rem
ovethe3variablesandcomputethechange
inSSR
orSSE
–TestH0:β3=
β4=
β5=
0usingmatrix
algebra(e.g.,
anova(fit,weight,
waist,tri-
cep)iffitisafitobject
createdby
theR
rms
package)
2.5
RelaxingLinearity
Assumption
forContinuousPredictors
2.5.1
AvoidingCategorization
�Relationships
seldom
linearexcept
whenpredicting
onevariablefrom
itselfmeasuredearlier
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
26
�Categorizingcontinuous
predictorsinto
intervalsis
adisaster1,2,4,8,21,44,46,64,66,73,81,84,93,96,99,108,115
�Som
eproblemscaused
bythisapproach:
1.Estimated
values
have
reducedprecision,
and
associated
testshave
reducedpow
er
2.Categorizationassumesrelationshipbetweenpre-
dictor
andresponse
isflat
withinintervals;
far
less
reasonable
than
alinearity
assumptionin
mostcases
3.Tomakeacontinuous
predictorbemoreac-
curately
modeled
whencategorization
isused,
multipleintervalsarerequired
4.Because
ofsamplesize
limitations
inthevery
low
andvery
high
rangeof
thevariable,the
outerintervals(e.g.,outerquintiles)willbewide,
resultinginsignificant
heterogeneityof
subjects
withinthoseintervals,andresidualconfounding
5.Categorizationassumes
that
thereisadiscon-
tinuityin
response
asinterval
boundariesare
crossed.
Other
than
theeffectof
time(e.g.,an
instantstockpricedrop
afterbadnews),there
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
27
arevery
fewexam
ples
inwhich
such
discontinu-
itieshave
beenshow
nto
exist.
6.Categorizationonlyseem
sto
yieldinterpretable
estimates.
E.g.odds
ratioforstroke
forper-
sons
withasystolicbloodpressure>
160mmHg
comparedto
persons
withabloodpressure≤
160mmHg→
interpretation
ofOR
depends
ondistribution
ofbloodpressuresin
thesam-
ple(the
proportion
ofsubjects
>170,
>180,
etc.).
Ifbloodpressure
ismodeled
asacon-
tinuousvariable(e.g.,usingaregression
spline,
quadratic,or
lineareffect)onecanestimatethe
ratioof
odds
forexactsettings
ofthepredictor,
e.g.,theodds
ratiofor200mmHgcomparedto
120mmHg.
7.Categorizationdoes
notconditionon
fullinfor-
mation.
When,
forexam
ple,therisk
ofstroke
isbeing
assessed
foranewsubjectwithaknow
nbloodpressure
(say
162mmHg),thesubject
does
notreportto
herphysician“m
ybloodpres-
sureexceeds160”butratherreports162mmHg.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
28
The
risk
forthissubjectwillbemuchlowerthan
that
ofasubjectwithabloodpressure
of200
mmHg.
8.Ifcutpointsaredeterm
ined
inaway
that
isnot
blindedto
theresponse
variable,calculationof
P-valuesandconfidenceintervalsrequires
spe-
cial
simulationtechniques;ordinary
inferential
methods
arecompletelyinvalid.E.g.:cutpoints
chosen
bytrialand
errorutilizing
Y,even
infor-
mally→
P-valuestoosm
allandCLsnotaccu-
rateb.
9.Categorizationnotblindedto
Y→
biased
effect
estimates
4,99
10.“Optimal”cutpointsdo
notreplicateover
stud-
ies.
Hollander
etal.66
statethat“...theoptimal
cutpoint
ap-
proach
hasdisadvantages.
One
oftheseisthat
inalmosteverystudy
where
thismethodisapplied,
anothercutpoint
will
emerge.This
makes
comparisons
across
studiesextrem
elydifficultor
even
impos-
sible.
Altman
etal.point
outthisproblem
forstudiesof
theprog-
nosticrelevanceof
theS-phase
fraction
inbreastcancer
publishedin
theliterature.
Theyidentified
19differentcutpointsused
inthelit-
bIf
acu
tpointis
chosenth
atminim
izes
theP-valueandth
eresu
ltingP-valueis
0.05,th
etruetypeIerrorca
nea
sily
beabove0.5
66.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
29
erature;someof
them
weresolelyused
because
they
emergedas
the
’optimal’cutpoint
inaspecificdata
set.
Inameta-analysison
the
relationship
betweencathepsin-D
contentanddisease-free
survival
innode-negativebreast
cancer
patients,12
studieswerein
included
with12
differentcutpoints...Interestingly,neithercathepsin-D
nor
theS-phasefraction
arerecommendedto
beused
asprognosticmark-
ersin
breast
cancer
intherecent
update
oftheAmerican
Society
of
ClinicalOncology.”
11.D
isagreem
entsincutpoints(w
hich
arebound
tohappen
wheneveronesearchesforthings
thatdo
notexist)
causesevere
interpretation
problems.
One
studymay
providean
odds
ratioforcom-
paring
bodymassindex(BMI)>
30withBMI
≤30,anotherforcomparing
BMI>
28with
BMI≤
28.Neither
ofthesehasagood
defini-
tion
andthetwoestimates
arenotcomparable.
12.C
utpointsarearbitraryandmanipulatable;cut-
pointscanbefoundthat
canresultinbothpos-
itiveandnegative
associations
115 .
13.Ifaconfounderisadjusted
forby
categorization,
therewill
beresidual
confoundingthat
canbe
explainedaw
ayby
inclusionof
thecontinuous
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
30
form
ofthepredictorin
themodel
inaddition
tothecategories.
�Tosummarize:
The
useof
a(single)
cutpoint
cmakes
manyassumptions,including:
1.RelationshipbetweenX
andY
isdiscontinuous
atX
=candonlyX
=c
2.ciscorrectlyfoundas
thecutpoint
3.X
vs.Y
isflat
totheleftof
c
4.X
vs.Y
isflat
totherightof
c
5.The
choice
ofcdoes
notdependon
thevalues
ofotherpredictors
2.5.2
Sim
ple
NonlinearTerm
s
C(Y|X
1)=β0+β1X
1+β2X
2 1.
�H0:model
islinearin
X1vs.
Ha:model
isquadraticin
X1≡
H0:β2=0.
�Testof
linearity
may
bepow
erfuliftrue
modelis
notextrem
elynon-parabolic
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
31
�Predictions
notaccurate
ingeneralas
manyphe-
nomenaarenon-quadratic
�Can
getmoreflexiblefitsby
adding
pow
ershigher
than
2
�But
polynom
ials
donotadequately
fitlogarith-
micfunctionsor“threshold”effects,andhave
un-
wantedpeaks
andvalleys.
2.5.3
SplinesforEstim
atingShapeofRegressionFunctionandDeterm
iningPredictor
Tra
nsform
ations
Dra
ftman’s
spline:flexiblestripofmetalorrub-
ber
used
totracecurves.
SplineFunction:piecew
isepolynom
ial
Lin
earSplineFunction:piecew
iselinearfunc-
tion
�Bilinearregression:modelisβ0+
β1X
ifX≤
a,
β2+β3X
ifX
>a.
�Problem
withthisnotation:twolines
notcon-
strained
tojoin
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
32
�Toforcesimplecontinuity:β0+β1X
+β2(X−
a)×
I[X
>a]=
β0+β1X
1+β2X
2,where
X2=(X
1−a)×I[X
1>
a].
�Slopeisβ1,X≤
a,β1+β2,X
>a.
�β2istheslopeincrem
entas
youpass
a
Moregenerally:X-axisdividedinto
intervalswith
endpointsa,b,c
(knots).
f(X
)=
β0+β1X
+β2(X−a) +
+β3(X−b)+
+β4(X−c)+,
where
(u) +
=u,u>
0,
0,u≤
0.
f(X
)=β0+β1X,
X≤
a
=β0+β1X
+β2(X−
a)
a<
X≤
b
=β0+β1X
+β2(X−
a)+β3(X−b)
b<
X≤
c
=β0+β1X
+β2(X−
a)
+β3(X−b)+β4(X−
c)c<
X.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
33
X
f(X)
01
23
45
6
Figure
2.1:
Alinearsplinefunctionwithknots
ata=
1,b=
3,c=
5.
C(Y|X
)=f(X
)=Xβ,
where
Xβ=
β0+β1X
1+β2X
2+β3X
3+β4X
4,and
X1=X
X2=(X−a) +
X3=(X−
b)+
X4=(X−c)+.
Overalllinearity
inX
canbetested
bytestingH0:
β2=β3=β4=0.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
34
2.5.4
Cubic
SplineFunctions
Cubicsplines
aresm
ooth
atknots(function,firstand
second
derivativesagree)
—can’tseejoins.
f(X
)=
β0+β1X
+β2X
2+β3X
3
+β4(X−a)3 +
+β5(X−b)3 ++β6(X−c)3 +
=Xβ
X1=X
X2=X
2
X3=X
3X4=(X−
a)3 +
X5=(X−b)3 +
X6=(X−c)3 +.
kknots→
k+3coeffi
cients
excludingintercept.
X2andX
3term
smustbeincluded
toallownonlin-
earity
whenX
<a.
2.5.5
Restricted
Cubic
Splines
Stone
andKoo
107 :
cubicsplines
poorlybehaved
intails.Constrain
function
tobelinearin
tails.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
35
k+3→
k−
1parameters3
7 .
Toforcelinearity
whenX
<a:X
2andX
3term
smustbeom
itted
Toforcelinearity
whenX
>last
knot:last
twoβs
areredundant,i.e.,arejustcombinationsoftheother
βs.
The
restricted
splinefunction
withkknotst 1,...,tk
isgivenby
37
f(X
)=β0+β1X
1+β2X
2+...+
βk−1X
k−1,
where
X1=X
andforj=1,...,k−2,
Xj+
1=
(X−t j)3 +−(X−
t k−1)3 +(t
k−t j)/(t
k−t k−1)
+(X−t k)3 +(t
k−1−t j)/(t
k−t k−1).
Xjislinearin
XforX≥
t k.
require(Hmisc)
x←
rcspline.e
val(se
q(0,1
,.0
1),
knots=se
q(.05,.95,length
=5),
inclx=TRUE)
xm←
xxm
[xm
>.0106]←
NA
matp
lot(x[,1
],
xm,
type=
”l”,
ylim=c(0,.0
1),
xlab=expre
ssio
n(X
),
ylab=
'',
lty=1)
matp
lot(x[,1
],
x,
type=
”l”,
xlab=expre
ssio
n(X
),
ylab=
'',
lty=1)
x←
seq(0,
1,
length
=300)
for(nk
in3:6
){
set.seed(nk)
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
36
0.0
0.2
0.4
0.6
0.8
1.0
0.0000.0040.008
X
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
X
Figure
2.2:
Restrictedcubic
splinecomponentvariablesfork=
5andknots
atX
=.05,.275,.5,.725,and.95.
Theleft
panel
isay–magnificationoftherightpanel.Fittedfunctionssuch
asthose
inFigure
2.3
willbelinear
combinationsofthesebasisfunctionsaslongasknots
are
atthesamelocationsusedhere.
knots←
seq(.05,
.95,
length
=nk)
xx←
rcspline.e
val(x,
knots=knots
,in
clx=TRUE)
for(i
in1:(nk−
1))
xx[,i]←
(xx[,i]−
min(xx[,i])
)/
(max
(xx[,i])−
min(xx[,i])
)fo
r(i
in1:2
0)
{beta
←2*runif(nk−
1)−
1xbeta←
xx%
*%beta
+2*runif
(1)−
1xbeta←
(xbeta−
min(xbeta
))
/(max
(xbeta
)−
min(xbeta
))
if(i=
=1)
{plo
t(x,
xbeta
,ty
pe=
”l”,
lty=1,
xlab=expre
ssio
n(X
),
ylab=
'',
bty=”l”)
tit
le(su
b=paste
(nk,”knots
”),
adj=
0,
cex=
.75)
for(j
in1:n
k)
arrows(knots
[j],
.04,
knots
[j],−.03,
angle=20,
length
=.07,
lwd=1.5
)}
else
lines(x,
xbeta
,col=
i)
}}
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
37
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
X3
knot
s
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
X4
knot
s
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
X5
knot
s
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
X6
knot
s
Figure
2.3:
Sometypicalrestricted
cubicsplinefunctionsfork=
3,4,5,6.They–axisisXβ.Arrow
sindicate
knots.
Thesecurves
werederived
byrandomly
choosingvalues
ofβsubject
tostandard
deviationsoffitted
functionsbeing
norm
alized.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
38
Onceβ0,...,βk−1areestimated,therestricted
cu-
bicsplinecanberestated
intheform
f(X
)=
β0+β1X
+β2(X−t 1)3 +
+β3(X−t 2)3 +
+...+
βk+1(X−t k)3 +
bycomputing
βk
=[β
2(t
1−t k)+β3(t
2−t k)+...
+βk−1(t
k−2−t k)]/(t k−
t k−1)
βk+1
=[β
2(t
1−t k−1)+β3(t
2−t k−1)+...
+βk−1(t
k−2−t k−1)]/(t k−1−
t k).
Atest
oflinearity
inXcanbeobtained
bytesting
H0:β2=β3=...=βk−1=0.
2.5.6
ChoosingNumberandPosition
ofKnots
�Knotsarespecified
inadvanceinregression
splines
�Locations
notimportant
inmostsituations
39,106
�Place
knotswhere
data
exist—
fixedquantilesof
predictor’smarginaldistribution
�Fitdepends
moreon
choice
ofk
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
39
kQuantiles
3.10
.5.90
4.05
.35
.65
.95
5.05
.275
.5.725
.95
6.05
.23
.41
.59
.77
.95
7.025
.1833
.3417
.5.6583
.8167
.975
n<
100–replaceouterquantileswith5thsm
allest
and5thlargestX
107 .
Choiceof
k:
�Flexibilityof
fitvs.nandvariance
�Usuallyk=3,4,5.
Often
k=4
�Large
n(e.g.n≥
100)
–k=5
�Smalln(<
30,say)
–k=3
�Can
useAkaike’sinform
ationcriterion(AIC)5
,111
tochoose
k
�Thischooseskto
maximizemodellikelihoodratio
χ2−2k
.
See
51foracomparisonof
restricted
cubicsplines,
fractionalpolynom
ials,andpenalized
splines.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
40
2.5.7
Nonpara
metric
Regression
�Estimatetendency
(meanor
median)
ofY
asa
function
ofX
�Few
assumptions
�Especially
handywhenthereisasingleX
�Plotted
trendlinemay
bethefinalresultof
the
analysis
�Simplestsm
oother:movingaverage
X:
12
35
8Y:2.13.85.711.1
17.2
E(Y|X
=2)
=2.1+3.8+5.7
3
E(Y|X
=2+3+5
3)=
3.8+5.7+11.1
3–overlapOK
–problem
inestimatingE(Y
)at
outerX-values
–estimates
very
sensitiveto
binwidth
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
41
�Movinglinearregression
farsuperiorto
moving
avg.
(movingflat
line)
�Cleveland’s
27movinglinearregression
smoother
loess(locally
weightedleastsquares)
isthemost
popular
smoother.Toestimatecentraltendency
ofY
atX
=x:
–take
allthedata
having
Xvalues
withinasuit-
ableintervalaboutx(defaultis
2 3of
thedata)
–fitweightedleastsquareslinearregression
within
thisneighborhood
–pointsnear
xgiventhemostweightc
–pointsnear
extrem
esof
interval
receivealmost
noweight
–loessworks
muchbetterat
extrem
esof
Xthan
movingavg.
–provides
anestimateat
each
observed
X;other
estimates
obtained
bylinearinterpolation
–outlierrejectionalgorithm
built-in
�loessworks
greatforbinary
Y—
just
turn
offoutlierdetection
cW
eightheremea
nssomething
differen
tth
an
regression
coeffi
cien
t.It
mea
nshow
much
apointis
emphasized
indev
eloping
the
regressionco
efficien
ts.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
42
�Otherpopularsm
oother:Friedman’s“supersm
oother”
�For
loessor
supsmuam
ount
ofsm
oothingcanbe
controlledby
analyst
�Another
alternative:
smoothingsplinesd
�Smoothersarevery
usefulforestimatingtrends
inresidualplots
2.5.8
AdvantagesofRegressionSplinesoverOth
erM
eth
ods
Regressionsplines
have
severaladvantages
60:
�Param
etricsplines
canbefitted
usinganyexisting
regression
program
�Regressioncoeffi
cients
estimated
usingstandard
techniques
(MLor
leastsquares),form
altestsof
nooverallassociation,linearity,and
additivity,con-
fidencelim
itsfortheestimated
regression
function
arederivedby
standard
theory.
�The
fitted
function
directlyestimates
transforma-
tion
predictorshould
receiveto
yieldlinearity
indTheseplace
knots
atallth
eobserved
data
points
butpen
alize
coeffi
cien
testimatestoward
ssm
ooth
ness.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
43
C(Y|X
).
�Evenwhenasimpletransformationis
obvious,
splinefunction
canbeused
torepresentthepredic-
torinthefinalm
odel(and
thed.f.willbecorrect).
Nonparametricmethods
donotyieldaprediction
equation.
�Extension
tonon-additive
models.
Multi-dimensionalnonparam
etricestimatorsoften
requireburdensomecomputations.
2.6
Recu
rsivePartitioning:Tree-B
ased
Models
Breiman,Friedman,Olshen,
andStone
18:CART
(Classification
andRegressionTrees)—
essentially
model-free
Method:
�Findpredictorso
that
bestpossiblebinarysplit
has
maximum
valueof
somestatisticforcomparing
2groups
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
44
�Withinpreviously
form
edsubsets,
find
bestpre-
dictor
andsplit
maximizingcriterionin
thesubset
�Proceed
inlikefashionuntil<
kobs.
remainto
split
�Sum
marizeY
fortheterm
inal
node
(e.g.,mean,
modalcategory)
�Prune
tree
backwarduntilitcross-validates
aswell
asits“apparent”accuracy,or
useshrinkage
Advantages/disadvantagesof
recursivepartitioning:
�Doesnotrequirefunctionalform
forpredictors
�Doesnotassumeadditivity
—canidentify
com-
plex
interactions
�Can
dealwithmissing
data
flexibly
�Interactions
detected
arefrequentlyspurious
�Doesnotusecontinuous
predictorseffectively
�Penalty
foroverfittingin
3directions
�Often
tree
doesn’tcross-validateoptimallyunless
pruned
back
very
conservatively
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
45
�Veryuseful
inmessy
situations
orthosein
which
overfittingisnotas
problematic
(confounderad-
justmentusingpropensity
scores
28;missing
value
imputation)
See
7 .
2.7
New
Directionsin
PredictiveM
odeling
The
approaches
recommendedin
thiscourse
are
�fittingfully
pre-specified
modelswithout
deletion
of“insignificant”predictors
�usingdata
reductionmethods
(maskedto
Y)to
reduce
thedimensionalityof
thepredictors
and
then
fittingthenumber
ofparametersthedata’s
inform
ationcontentcansupport
�useshrinkage(penalized
estimation)
tofitalarge
modelwithout
worryingaboutthesamplesize.
The
data
reductionapproach
canyieldvery
inter-
pretable,stable
models,
buttherearemanydeci-
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
46
sionsto
bemadewhenusingatwo-stage(reduc-
tion/m
odelfitting)
approach,New
erapproaches
are
evolving,includingthefollowing.
These
new
ap-
proach
handlecontinuous
predictors
well,unlikere-
cursivepartitioning.
�lasso(shrinkage
usingL1norm
favoring
zero
re-
gression
coeffi
cients)1
05,110
�elasticnet(com
bination
ofL1andL2norm
sthat
handlesthep>
ncase
betterthan
thelasso)
129
�adaptive
lasso116,127
�moreflexiblelassoto
differentiallypenalizeforvari-
ableselectionandforregression
coeffi
cientestima-
tion
92
�grouplassoto
forceselectionof
allor
none
ofa
groupof
relatedvariables(e.g.,dummyvariables
representing
apolytom
ouspredictor)
�grouplasso-likeprocedures
that
also
allowforvari-
ableswithinagroupto
beremoved
117
�adaptive
grouplasso(W
ang&
Leng)
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
47
�Breiman’snonnegativegarrote124
�“preconditioning”,
i.e.,model
simplification
after
developing
a“black
box”predictive
model87
�sparse
principalcomponents
analysis
toachieve
parsimonyin
data
reduction77,78,121,128
�bagging,
boosting,
andrandom
forests6
2
One
problem
prevents
mostof
thesemethods
from
being
readyforeveryday
use:
they
requirescaling
predictors
beforefittingthemodel.Whenapredic-
torisrepresentedby
nonlinearbasisfunctions,
the
scalingrecommendationsin
theliteraturearenot
sensible.
There
arealso
computational
issues
and
difficultiesobtaininghypothesistestsandconfidence
intervals.
Whendata
reductionisnotrequired,generalized
ad-
ditive
models6
3,122should
also
beconsidered.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
48
2.8
Multiple
DegreeofFreedom
TestsofAssociation
C(Y|X
)=β0+β1X
1+β2X
2+β3X
2 2,
H0:β2=
β3=
0with2d.f.
toassess
association
betweenX2andoutcom
e.
Inthe5-knot
restricted
cubicsplinemodel
C(Y|X
)=β0+β1X
+β2X′+β3X′′+β4X′′′ ,
H0:β1=...=β4=0
�Testof
association:
4d.f.
�Insignificant→
dangerousto
interpretplot
�Whatto
doif4d.f.
test
insignificant,3d.f.
test
forlinearity
insig.,1d.f.
test
sig.
afterdelete
nonlinearterm
s?
Grambsch
andO’Brien
52elegantlydescribed
thehaz-
ards
ofpretesting
�Studied
quadraticregression
�Showed
2d.f.test
ofassociationis
nearly
opti-
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
49
mal
even
whenregression
islinearifnonlinearity
entertained
�Consideredordinary
regression
model
E(Y|X
)=β0+β1X
+β2X
2
�Twowaysto
test
associationbetweenX
andY
�Fitquadraticmodel
andtest
forlinearity
(H0:
β2=0)
�F-testforlinearity
significant
atα
=0.05
level
→reportas
thefinaltestof
associationthe2d.f.
Ftest
ofH0:β1=β2=0
�Ifthetest
oflinearity
insignificant,refitwithout
thequadraticterm
andfinaltest
ofassociationis
1d.f.test,H0:β1=0|β2=0
�Showed
that
typeIerror>
α
�Fairlyaccurate
P-value
obtained
byinsteadtest-
ingagainstF
with2d.f.even
atsecond
stage
�Cause:areretainingthemostsignificant
part
ofF
�BUT
iftest
against2d.f.canonly
lose
pow
erwhencomparedwithoriginal
Ffortestingboth
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
50
βs
�SSR
from
quadraticmodel
>SSR
from
linear
model
2.9
AssessmentofM
odelFit
2.9.1
RegressionAssumptions
The
generallinearregression
modelis
C(Y|X
)=Xβ=β0+β1X
1+β2X
2+...+
βkXk.
Verifylinearity
andadditivity.Specialcase:
C(Y|X
)=β0+β1X
1+β2X
2,
where
X1isbinary
andX2iscontinuous.
X2
C(Y)
X1
=0
X1
=1
Figure
2.4:
Regressionassumptionsforonebinary
andonecontinuouspredictor.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
51
Methods
forchecking
fit:
1.Fitsimplelinearadditive
modelandcheckexam
ine
residualplotsforpatterns
�For
OLS:box
plotsof
estratified
byX1,
scat-
terplots
ofevs.X2andY,withtrendcurves
(wantflat
centraltendency,constant
variability)
�For
norm
ality,
qqnormplotsof
overallandstrat-
ified
residuals
Advantage:Simplicity
Disadvantages:
�Can
onlycompute
standard
residualsforuncen-
soredcontinuous
response
�Subjectivejudgmentof
non-random
ness
�Hardto
handleinteraction
�Hardto
seepatterns
withlargen(trend
lines
help)
�Seeingpatterns
does
notlead
tocorrective
ac-
tion
2.Scatterplot
ofY
vs.X2usingdifferentsymbols
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
52
accordingto
values
ofX1
Advantages:
Simplicity,canseeinteraction
Disadvantages:
�Scatterplotscannot
bedraw
nforbinary,cate-
gorical,or
censored
Y
�Patternsdifficultto
seeifrelationshipsareweak
ornlarge
3.Stratifythesampleby
X1andquantile
groups
(e.g.deciles)
ofX2;
estimateC(Y|X
1,X2)
for
each
stratum
Advantages:
Simplicity,canseeinteractions,han-
dles
censored
Y(ifyouarecareful)
Disadvantages:
�Requireslargen
�Doesnotusecontinuous
var.effectively(noin-
terpolation)
�Subgroupestimates
have
lowprecision
�Dependent
onbinningmethod
4.Separatelyforlevels
ofX1fitanonparam
etric
smoother
relating
X2to
Y
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
53
Advantages:
Allregression
aspectsof
themodel
canbesummarized
efficientlywithminimal
as-
sumptions
Disadvantages:
�Doesnotapplyto
censored
Y
�Hardto
dealwithmultiplepredictors
5.Fitflexiblenonlinearparametricmodel
Advantages:
�One
fram
eworkforexam
iningthemodelassump-
tions,
fittingthemodel,draw
ingform
alinfer-
ence
�d.f.definedandallaspects
ofstatisticalinfer-
ence“w
orkas
advertised”
Disadvantages:
�Com
plexity
�Generallydifficultto
allowforinteractions
when
assessingpatterns
ofeffects
Confidencelim
its,form
alinferencecanbeproblem-
aticformethods
1-4.
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
54
Restrictedcubicsplineworks
wellformethod5.
C(Y|X
)=
β0+β1X
1+β2X
2+β3X′ 2+β4X′′ 2
=β0+β1X
1+f(X
2),
where
f(X
2)=β2X
2+β3X′ 2+β4X′′ 2,
f(X
2)spline-estimated
transformationof
X2.
�Plotf(X
2)vs.X2
�nlarge→
canfitseparate
functionsby
X1
�Testof
linearity:H0:β3=β4=0
�Nonlinear→
usetransformationsuggestedby
spline
fitor
keep
splineterm
s
�Tentative
transformationg(X
2)→
checkadequacy
byexpandingg(X
2)insplinefunction
andtesting
linearity
�Can
find
transformations
byplotting
g(X
2)vs.
f(X
2)forvarietyof
g
�Multiplecontinuous
predictors→
expand
each
us-
ingspline
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
55
�Example:
assess
linearity
ofX2,X3
C(Y|X
)=
β0+β1X
1+β2X
2+β3X′ 2+β4X′′ 2
+β5X
3+β6X′ 3+β7X′′ 3,
Overalltest
oflinearity
H0:β3=
β4=
β6=
β7=
0,with4d.f.
2.9.2
ModelingandTestingComplexIn
tera
ctions
X1binary
orlinear,X2continuous:
C(Y|X
)=
β0+β1X
1+β2X
2+β3X′ 2+β4X′′ 2
+β5X
1X2+β6X
1X′ 2+β7X
1X′′ 2
Simultaneoustest
oflinearity
andadditivity:H0:
β3=...=β7=0.
�2continuous
variables:
couldtransformseparately
andform
simpleproduct
�Transform
ations
dependon
whether
interaction
term
sadjusted
for
�Fitinteractions
oftheform
X1f(X
2)andX2g(X
1):
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
56
C(Y|X
)=
β0+β1X
1+β2X′ 1+β3X′′ 1
+β4X
2+β5X′ 2+β6X′′ 2
+β7X
1X2+β8X
1X′ 2+β9X
1X′′ 2
+β10X2X′ 1+β11X2X′′ 1
�Testof
additivity
isH0:β7=
β8=
...=
β11
=0with5d.f.
�Testof
lack
offitforthesimpleproductinterac-
tion
withX2isH0:β8=β9=0
�Testof
lack
offitforthesimpleproductinterac-
tion
withX1isH0:β10
=β11
=0
Generalsplinesurface:
�Cover
X1×X2planewithgridandfitpatch-wise
cubicpolynom
ialin
twovariables
�Restrictto
beof
form
aX1+
bX2+
cX1X
2in
corners
�Usesall(k−1)2cross-productsof
restricted
cubic
splineterm
s
�See
Gray[53,54,Section
3.2]forpenalized
splines
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
57
allowingcontrolof
effective
degreesof
freedom.
See
Berhane
etal.12foragood
discussion
often-
sorsplines.
Other
issues:
�Y
non-censored
(especially
continuous)→
multi-
dimensionalscatterplotsm
oother
22
�Interactions
oforder>
2:moretrouble
�2-way
interactions
amongppredictors:
pooled
tests
�ptestseach
withp−1d.f.
Som
etypes
ofinteractions
topre-specifyin
clinical
studies:
�Treatment×
severity
ofdiseasebeing
treated
�Age×
risk
factors
�Age×
typeof
disease
�Measurement×
stateof
asubjectduring
mea-
surement
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
58
�Race×
disease
�Calendartime×
treatm
ent
�Quality×
quantity
ofasymptom
2.9.3
FittingOrd
inalPredictors
�Smallno.categories
(3-4)→
polytom
ousfactor,
dummyvariables
�Designmatrixforeasy
test
ofadequacy
ofinitial
codes→
koriginalcodes+
k−2dummies
�Morecategories→
scoreusingdata-driventrend.
Later
testsusek−1d.f.insteadof
1d.f.
�E.g.,compute
logit(mortality)
vs.category
2.9.4
DistributionalAssumptions
�Som
emodels(e.g.,logistic):
allassumptions
inC(Y|X
)=
Xβ(implicitly
assumingno
omitted
variables!)
�Linearregression:Y∼
Xβ+ǫ,ǫ∼
n(0,σ
2 )
�Examinedistribution
ofresiduals
CHAPTER
2.
GENERALASPECTSOFFIT
TIN
GREGRESSIO
NMODELS
59
�Som
emodels(W
eibull,
Cox
31):
C(Y|X
)=C(Y
=y|X
)=d(y)+Xβ
C=
loghazard
�Check
form
ofd(y)
�Showd(y)does
notinteract
withX
Chapter3
Multivariable
ModelingStrategies
�“Spending
d.f.”:exam
iningor
fittingparametersin
models,or
exam
iningtables
orgraphs
that
utilize
Yto
tellyouhowto
modelvariables
�Ifwishto
preserve
statisticalproperties,can’tre-
trieve
d.f.once
they
are“spent”(see
Grambsch
&O’Brien)
�Ifascatterplotsuggestslinearity
andyoufitalin-
earmodel,howmanyd.f.didyouactuallyspend
(i.e.,thed.f.that
whenputinto
aform
ularesults
inaccurate
confidencelim
itsor
P-values)?
�Decidenumber
ofd.f.that
canbespent
60
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S61
�Decidewhere
tospendthem
�Spendthem
3.1
Presp
ecifica
tion
ofPredictorComplexity
WithoutLaterSim
plifi-
cation
�Rarelyexpectlinearity
�Can’talwaysusegraphs
orotherdevicesto
choose
transformation
�Ifselect
from
amongmanytransformations,re-
sultsbiased
�Needto
allow
flexible
nonlinearity
topotentially
strong
predictorsnotknow
nto
predictlinearly
�Oncedecide
apredictoris“in”
canchoose
no.of
parametersto
devote
toitusingageneralassoci-
ationindexwithY
�Needameasure
of“potentialpredictive
punch”
�Measure
needsto
maskanalystto
true
form
ofregression
topreserve
statisticalproperties
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S62
3.1.1
Learn
ingFro
maSatu
ratedM
odel
Whentheeffective
samplesize
availableissufficiently
largeso
that
asaturatedmaineffectsmodelmay
be
fitted,agood
approach
togaugingpredictive
poten-
tialisthefollowing.
�Let
allcontinuous
predictorsberepresentedas
re-
stricted
cubicsplines
withkknots,where
kisthe
maximum
number
ofknotstheanalystentertains
forthecurrentproblem.
�Let
allcategoricalpredictors
retain
theiroriginal
categories
except
forpoolingof
very
low
preva-
lencecategories
(e.g.,ones
containing
<6obser-
vations).
�Fitthisgeneralmaineffects
model.
�Com
pute
thepartialχ2statisticfortestingthe
associationof
each
predictorwiththeresponse,
adjusted
forallotherpredictors.In
thecase
ofordinary
regression
convertpartialF
statistics
toχ2statistics
orpartialR2values.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S63
�Makecorrectionsforchance
associations
to“level
theplayingfield”forpredictorshaving
greatlyvary-
ingd.f.,e.g.,subtract
thed.f.from
thepartialχ2
(the
expectedvalueof
χ2 pispunderH0).
�Makecertainthat
testsof
nonlinearity
arenotre-
vealed
asthiswould
bias
theanalyst.
�Sortthepartialassociation
statistics
indescending
order.
Com
mands
inthermspackagecanbeused
toplot
onlywhatisneeded.Hereisan
exam
pleforalogistic
model.
f←
lrm(y∼
sex
+ra
ce
+rc
s(age,5
)+
rcs(weight,5
)+
rcs(height,5
)+
rcs(blo
od.p
ressure
,5))
plo
t(anova(f))
3.1.2
UsingM
arg
inalGenera
lize
dRankCorrelations
Whencollinearitiesor
confoundingarenotproblem-
atic,aquickerapproach
basedon
pairwisemeasures
ofassociationcanbeuseful.Thisapproach
willnot
have
numerical
problems(e.g.,singular
covariance
matrix)
andisbasedon:
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S64
�2d.f.generalizationof
Spearm
anρ—R2basedon
rank(X
)andrank(X
)2vs.rank(Y
)
�ρ2candetect
U-shaped
relationships
�For
categoricalX,ρ2is
R2from
dummyvari-
ablesregressedagainstrank(Y
);this
istightly
relatedto
theWilcoxon–M
ann–Whitney–K
ruskal–
Wallis
rank
test
forgroupdifferencesa
�Sortvariablesby
descending
orderof
ρ2
�Specifynumber
ofknotsforcontinuous
X,com-
bine
infrequent
categories
ofcategoricalX
based
onρ2
Allocating
d.f.basedon
partialtestsof
association
orsortingρ2isafairprocedurebecause
�Wealreadydecidedto
keep
variablein
modelno
matterwhatρ2or
χ2values
areseen
�ρ2andχ2do
notreveal
degree
ofnonlinearity;
high
valuemay
beduesolelyto
strong
lineareffect
�lowρ2or
χ2foracategoricalvariablemight
lead
aThis
test
statistic
does
notinform
theanalyst
ofwhichgroupsare
differen
tfrom
oneanoth
er.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S65
tocollapsingthemostdisparatecategories
Initialsimulations
show
theprocedureto
beconser-
vative.Notethat
onecanmovefrom
simplerto
more
complex
modelsbutnottheotherway
round
3.2
Check
ingAssumptionsofM
ultiple
Predictors
Sim
ultaneously
�Som
etimesfailureto
adjustforothervariablesgives
wrong
transformationof
anX,or
wrong
signifi-
canceof
interactions
�Som
etimes
unwieldyto
deal
simultaneouslywith
allpredictors
ateach
stage→
assess
regression
assumptions
separatelyforeach
predictor
3.3
Variable
Selection
�Seriesof
potentialpredictorswithno
priorknow
l-edge
�↑exploration→↑shrinkage(overfitting)
�Sum
maryofproblem:E(β|β“significant”)6=β24
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S66
�BiasedR2 ,β,standard
errors,P-valuestoosm
all
�F
andχ2statistics
donothave
theclaimed
dis-
tribution52
�Will
resultin
residual
confoundingifusevariable
selectionto
find
confounders5
6
�Derksen
andKeselman
36foundthat
instepwise
analyses
thefinalmodel
representednoise0.20-
0.74
oftime,
finalmodel
usually
contained<
1 2actualnumber
ofauthenticpredictors.Also:
1.“T
hedegree
ofcorrelationbetweenthepre-
dictor
variablesaffectedthefrequencywith
which
authenticpredictorvariablesfound
theirway
into
thefinalmodel.
2.The
numberofcandidatepredictorvariables
affectedthenumber
ofnoisevariablesthat
gained
entryto
themodel.
3.The
size
ofthesamplewas
oflittlepracti-
calimportancein
determ
iningthenumber
ofauthenticvariablescontainedinthefinal
model.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S67
4.The
populationmultiplecoeffi
cientof
de-
term
inationcouldbefaithfullyestimated
byadopting
astatisticthat
isadjusted
bythe
totalnumber
ofcandidatepredictorvari-
ablesrather
than
thenumber
ofvariables
inthefinalmodel”.
�Globaltest
withpd.f.insignificant→
stop
Variableselectionmethods
57:
�Forwardselection,
backwardelimination
�Stoppingrule:“residualχ2 ”
withd.f.=
no.can-
didatesremaining
atcurrentstep
�Testforsignificanceor
useAkaike’sinform
ation
criterion(AIC
5 ),here
χ2−
2×d.f.
�Betterto
usesubjectmatterknow
ledge!
�Nocurrentlyavailablestopping
rulewas
developed
forstepwise,
only
forcomparing
2pre-specified
models[16,
Section
1.3]
�Roecker
95studiedforwardselection(FS),allpos-
siblesubsetsselection(APS),fullfits
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S68
�APSmorelikelyto
select
smaller,
less
accurate
modelsthan
FS
�Neither
asaccurate
asfullmodel
fitunless
>1 2
candidatevariablesredundantor
unnecessary
�Step-downis
usually
betterthan
forward80
and
canbeused
efficientlywithmaximum
likelihood
estimation74
�Fruitless
totrydifferentstepwisemethods
tolook
foragreem
ent1
20
�Bootstrap
canhelp
decide
betweenfullandre-
ducedmodel
�Fullmodelfits
givesmeaningfulconfidenceinter-
vals
withstandard
form
ulas,C.I.afterstepwise
does
not3
,16,67
�Datareduction(groupingvariables)
canhelp
�Using
thebootstrap
toselect
important
variables
forinclusionin
thefinalmodel98
isproblematic6
�Itisnotlogicalthat
apopulationregression
coef-
ficientwould
beexactlyzero
justbecause
itsesti-
matewas“insignificant”
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S69
3.4
Overfi
ttingand
Lim
itson
NumberofPredictors
�Concerned
withavoiding
overfitting
�Assum
etypicalproblem
inmedicine,epidem
iology,
andthesocialsciencesinwhich
thesignal:noisera-
tioissm
all(higherratios
allowformoreaggressive
modeling)
�pshould
be<
m 1558,59,88,89,101,114
�p=numberofparametersinfullmodelornumber
ofcandidateparametersin
astepwiseanalysis
Table
3.1:
Lim
itingSample
Sizes
forVariousResponse
Variables
Typeof
Respon
seVariable
Lim
itingSam
ple
Sizem
Con
tinuou
sn(total
sample
size)
Binary
min(n
1,n
2)
b
Ordinal
(kcategories)
n−
1 n2
∑
k i=1n3 i
c
Failure
(survival)time
number
offailures
d
�Narrowlydistributedpredictor→
even
higher
n
�pincludes
allvariablesscreened
forassociation
withresponse,includinginteractions
aIf
oneco
nsidersth
epower
ofatw
o-sample
binomialtest
comparedwithaW
ilcoxontest
ifth
eresp
onse
could
bemadeco
ntinuous
andth
eproportionaloddsassumptionholds,
theeff
ectivesample
size
forabinary
resp
onse
is3n1n2/n≈
3min(n
1,n
2)if
n1
nis
nea
r0or
1[119,Eq.10,15].
Heren1andn2are
themarginalfreq
uen
cies
ofth
etw
oresp
onse
levels[89].
bBased
on
thepower
ofa
proportionaloddsmodel
two-sample
test
when
themarginalcell
sizesforth
eresp
onse
are
n1,...,n
k,
comparedwithallcellsizeseq
ualto
unity(response
isco
ntinuous)
[119,Eq,3].
Ifallcellsizesare
equal,th
erelativeeffi
cien
cyofhavingk
resp
onse
categories
comparedto
aco
ntinuousresp
onse
is1−
1
k2[119,Eq.14],e.g.,a5-lev
elresp
onse
isalm
ost
aseffi
cien
tasaco
ntinuous
oneifproportionaloddsholdsacross
category
cutoffs.
cThis
isapproxim
ate,asth
eeff
ectivesample
size
may
sometim
esbeboosted
somew
hatby
censored
observations,
especially
for
non-proportionalhaza
rdsmethodssu
chasW
ilcoxon-typetests11.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S70
�Univariablescreening(graphs,crosstabs,etc.)in
nowayreducesmultiplecomparisonproblemsof
modelbuilding109
3.5
Shrinkage
�Slopeof
calibration
plot;regression
tothemean
�Statistical
estimationprocedure—
“pre-shrunk”
models
�Aren’tregression
coeffi
cients
OKbecause
they’re
unbiased?
�Problem
isin
howweusecoeffi
cientestimates
�Consider20
samples
ofsize
n=50
from
U(0,1)
�Com
putegroupmeans
andplot
inascendingorder
�Equivalentto
fittingan
interceptand19
dummies
usingleastsquares
�Resultgeneralizes
togeneralproblemsin
plotting
Yvs.Xβ
set.seed(1
23)
n←
50
y←
runif(2
0*n)
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S71
gro
up←
rep(1:2
0,each=n)
ybar←
tapply
(y,
group,
mean)
ybar←
sort
(ybar)
plo
t(1:2
0,
ybar,
type=
'n',
axes=
FALSE,
ylim=c(.3
,.7
),
xlab=
'Group
',
ylab=
'Group
Mean')
lines(1:2
0,
ybar)
poin
ts(1:2
0,
ybar,
pch=20,
cex=
.5)
axis
(2)
axis
(1,
at=1:20,
labels=FALSE)
for(j
in1:2
0)
axis
(1,
at=
j,
labels=names
(ybar)[j])
abline(h=
.5,
col=
gra
y(.8
5))
Gro
up
Group Mean
●
●
●●
●●
●
●●
●●
●
●●
●●
●●
●
●
0.30.40.50.60.7
166
172
1014
209
87
1118
54
31
1513
1912
Figure
3.1:
Sorted
meansfrom
20samplesofsize
50from
auniform
[0,1]distribution.Thereference
lineat0.5
depicts
thetruepopulationvalueofallofthemeans.
�Prevent
shrinkageby
usingpre-shrinkage
�Spiegelhalter
103 :
var.
selectionarbitrary,
better
prediction
usually
resultsfrom
fittingallcandidate
variablesandusingshrinkage
�Shrinkage
closer
tothat
expectedfrom
fullmodel
fitthan
basedon
numberof
significant
variables29
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S72
�Ridge
regression
75,111
�Penalized
MLE53,61,112
�Heuristicshrinkageparameterof
vanHouwelingen
andleCessie[111,Eq.
77]
γ=
modelχ2−p
modelχ2
,
�OLS:γ=
n−p−
1n−1R2 adj/R2
R2 adj=1−(1−R2 )
n−1
n−p−
1
�pcloseto
no.candidatevariables
�Copas
[29,
Eq.
8.5]
adds
2to
numerator
3.6
Collinearity
�Whenat
least1predictorcanbepredictedwell
from
others
�Can
beablessing
(datareduction,
transforma-
tions)
�↑s.e.
ofβ,↓pow
er
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S73
�Thisis
appropriate→
asking
toomuchof
the
data
[25,
p.173]
�Variablescompeteinvariableselection,chosen
one
arbitrary
�Doesnotaffectjointinfluenceof
asetof
highly
correlated
variables(use
multipled.f.tests)
�Doesnotat
allaffectpredictionson
model
con-
structionsample
�Doesnotaffectpredictionson
newdata
[85,
pp.
379-381]
if
1.Extremeextrapolationnotattempted
2.New
data
have
sametypeof
collinearitiesas
originaldata
�Example:
LDLandtotalcholesterol–problem
onlyifmoreinconsistent
innewdata
�Example:
ageandage2
–no
problem
�One
way
toquantify
foreach
predictor:
variance
inflationfactors(VIF)
�Generalapproach
(maximum
likelihood)
—trans-
form
inform
ationmatrixto
correlationform
,VIF=diagon
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S74
ofinverse35,118
�See
Belsley
[9,pp.28-30]
forproblemswithVIF
�Easyapproach:SASVARCLUSprocedure97,Svar-
clusfunction,otherclustering
techniques:group
highlycorrelated
variables
�Can
scoreeach
group(e.g.,firstprincipalcompo-
nent,PC134);summaryscores
notcollinear
3.7
Data
Reduction
�Unlessn>>
p,modelunlikelyto
validate
�Datareduction:↓p
�Use
theliteratureto
eliminateunimportant
vari-
ables.
�Elim
inatevariableswhosedistributionsaretoonar-
row.
�Elim
inatecandidatepredictorsthat
aremissing
inalargenumberofsubjects,especially
ifthosesame
predictorsarelikelyto
bemissing
forfuture
appli-
cationsof
themodel.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S75
�Use
astatisticaldatareductionmethodsuch
asin-
completeprincipalcom
ponentsregression,nonlin-
eargeneralizations
ofprincipalcomponents
such
asprincipalsurfaces,sliced
inverseregression,vari-
able
clustering,or
ordinary
clusteranalysison
ameasure
ofsimilarity
betweenvariables.
3.7.1
Redundancy
Analysis
�Rem
ovevariablesthat
have
poordistributions
–E.g.,categoricalvariables
withfewerthan
2cat-
egorieshaving
atleast20
observations
�Use
flexibleadditive
parametricadditive
modelsto
determ
inehowwelleachvariablecanbepredicted
from
theremaining
variables
�Variables
dropped
instepwisefashion,
removing
themostpredictablevariableat
each
step
�Rem
aining
variablesused
topredict
�Process
continuesuntilno
variablestillin
thelist
ofpredictors
canbepredictedwithan
R2or
ad-
justed
R2greaterthan
aspecified
thresholdor
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S76
untildropping
thevariable
withthehighestR2
(adjustedor
ordinary)wouldcauseavariablethat
was
dropped
earlierto
nolonger
bepredictedat
thethresholdfrom
thenowsm
allerlistof
predic-
tors
�R/S
function
redunin
Hmiscpackage
�Related
toprincipalvariables82
butfaster
3.7.2
Variable
Clustering
�Goal:Separatevariablesinto
groups
–variableswithingroupcorrelated
witheach
other
–variablesnotcorrelated
withnon-groupmem
-bers
�Score
each
dimension,stop
trying
toseparate
ef-
fectsof
factorsmeasuring
samephenom
enon
�Variableclustering
34,97(oblique-rotationPCanal-
ysis)→
separate
variablesso
that
firstPCisrep-
resentativeof
group
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S77
�Can
also
dohierarchicalclusteranalysison
similar-
itymatrixbasedon
squaredSpearm
anor
Pearson
correlations,or
moregenerally,Hoeffding’sD
65.
3.7.3
Tra
nsform
ationandSca
lingVariablesW
ithoutUsingY
�Reducepby
estimatingtransformations
usingas-
sociations
withotherpredictors
�Purelycategoricalpredictors–correspondenceanal-
ysis26,33,55,76,83
�Mixture
ofqualitativeandcontinuous
variables:
qualitativeprincipalcomponents
�Maximum
totalvariance(M
TV)ofYoung,T
akane,
deLeeuw
83,126
1.Com
putePC1ofvariablesusingcorrelationma-
trix
2.Use
regression
(withsplines,dummies,etc.)to
predictPC1from
each
X—
expand
each
Xj
andregressitseparatelyon
PC1to
getworking
transformations
3.Recom
pute
PC1on
transformed
Xs
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S78
4.Repeat3-4times
untilvariationexplainedby
PC1plateaus
andtransformations
stabilize
�Maximum
generalized
variance
(MGV)methodof
Sarle[72,
pp.1267-1268]
1.Predict
each
variablefrom
(current
transforma-
tionsof)allothervariables
2.For
each
variable,expand
itinto
linearandnon-
linearterm
sor
dummies,compute
firstcanoni-
calvariate
3.For
exam
ple,ifthereareonlytwovariablesX1
andX2representedas
quadraticpolynom
ials,
solvefora,b,c,d
such
that
aX1+
bX2 1has
maximum
correlationwithcX
2+dX
2 2.
4.Goalisto
transform
each
var.so
that
itismost
similarto
predictionsfrom
othertransformed
variables
5.Doesnotrelyon
PCsor
variableclustering
�MTV
(PC-based
insteadof
canonicalvar.)
and
MGVimplem
entedin
SASPROC
PRINQUAL72
1.Allowsflexibletransformations
includingmono-
tonicsplines
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S79
2.Doesnotallowrestricted
cubicsplines,so
may
beunstableunless
monotonicityassumed
3.Allowssimultaneousimputation
butoftenyields
wild
estimates
3.7.4
Sim
ultaneousTra
nsform
ation
andIm
putation
StranscanFunctionforDataReduction
&Im
puta-
tion �Initializemissingsto
medians
(ormostfrequent
category)
�Initializetransformations
tooriginalvariables
�Takeeach
variablein
turn
asY
�Exclude
obs.
missing
onY
�ExpandY
(splineor
dummyvariables)
�Score
(transform
Y)usingfirstcanonicalvariate
�Missing
Y→
predictcanonicalvariatefrom
Xs
�The
imputedvalues
canoptionally
beshrunk
toavoidoverfittingforsm
allnor
largep
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S80
�Constrain
imputedvalues
tobein
rangeof
non-
imputedones
�Im
putationson
originalscale
1.Continuous→
back-solve
withlinearinterpola-
tion
2.Categorical→
classification
tree
(mostfreq.cat.)
ormatch
tocategory
whose
canonicalscoreis
closestto
onepredicted
�Multipleimputation
—bootstrap
orapprox.B
ayesian
boot.
1.Sam
pleresidualsmultipletimes
(defaultM
=5)
2.Are
on“optimally”transformed
scale
3.Back-transform
4.fit.mult.imputeworks
witharegImputeandtran-
scanoutput
toeasily
getimputation-corrected
variancesandavg.
β
�Optionto
insertconstantsas
imputedvalues
(ig-
noredduring
transformationestimation);helpful
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S81
whenalabvaluemay
bemissing
because
thepa-
tientreturned
tonorm
al
�Im
putationsandtransformed
values
may
beeasily
obtained
fornewdata
�AnSfunction
Functionwill
create
aseries
ofS
functionsthat
transform
each
predictor
�Example:
n=415acutelyillpatients
1.Relateheartrateto
meanarterialbloodpressure
2.Twobloodpressuresmissing
3.Heart
rate
notmonotonically
relatedto
blood
pressure
4.See
Figure3.2
require(Hmisc)
getH
data
(support
)#
Get
data
frame
from
web
site
heart.rate
←support$hrt
blo
od.p
ressure←
support$meanbp
blo
od.p
ressure
[400:4
01]
Mean
Arteria
lBlood
Pre
ssure
Day
3[1
]151
136
blo
od.p
ressure
[400:4
01]←
NA
#Create
two
missings
d←
data
.fra
me(heart
.rate,
blo
od.p
ressure
)par(pch=46)
w←
transcan(∼
heart.rate
+blo
od.p
ressure
,transfo
rmed=TRUE,
imputed=TRUE,
show.na=
TRUE,
data=d)
Converg
ence
criterio
n:2
.901
0.035
0.007
Converg
ence
in4
iteratio
ns
R2
achieved
inpredic
tin
geach
varia
ble
:
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S82
heart
.ra
te
blood.pre
ssure
0.259
0.259
Adju
sted
R2:
heart
.ra
te
blood.pre
ssure
0.254
0.253
w$im
puted$blo
od.p
ressure
400
401
132.4057
109.7741
plo
t(heart
.rate,
blo
od.p
ressure
)t←
w$transfo
rmed
plo
t(t[,'heart.rate
'],
t[,'blo
od.p
ressure
'],
xlab=
'Tra
nsform
ed
hr',
ylab=
'Tra
nsform
ed
bp
')
spe←
round(c(sp
earm
an(heart
.rate,
blo
od.p
ressure
),
spearm
an(t[,'heart.rate
'],
t[,'blo
od.p
ressure
'])
),
2)
ACE(Alternating
ConditionalExpectation)ofBreiman
andFriedman
17
1.Usesnonparam
etric“super
smoother”48
2.Allowsmonotonicityconstraints,categoricalvars.
3.Doesnothandlemissing
data
�These
methods
find
marginaltransformations
�Check
adequacy
oftransformations
usingY
1.Graphical
2.Nonparametricsm
oothers(X
vs.Y)
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S83
050
100
150
200
250
300
02468
hear
t.rat
e
Transformed heart.rate
R2=
0.26
050
100
150
−8−6−4−20
bloo
d.pr
essu
re
Transformed blood.pressure
R2=
0.26
2 m
issi
ng
050
100
150
200
250
300
050100150
hear
t.rat
e
blood.pressure
02
46
8
−8−6−4−20
Tran
sfor
med
hr
Transformed bp
Figure
3.2:
Transform
ationsfitted
using
transcan.Tickmarksindicate
thetw
oim
putedvalues
forbloodpressure.
Thelower
left
plotcontainsraw
data
(Spearm
anρ=−0.02);
thelower
rightis
ascatterplotofthecorresponding
transform
edvalues
(ρ=−0.13).
Data
courtesyoftheSUPPORT
study70.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S84
3.Expandoriginalvariableusingspline,test
addi-
tionalpredictive
inform
ationover
originaltrans-
form
ation
3.7.5
Sim
ple
Sco
ringofVariable
Clusters
�Try
toscoregroups
oftransformed
variableswith
PC1
�Reduces
d.f.by
pre-transformingvar.andby
com-
bining
multiplevar.
�Later
may
wantto
breakgroupapart,butdelete
allvariablesin
groups
whose
summaryscores
donotaddsignificant
inform
ation
�Som
etimes
simplifyclusterscoreby
findingasub-
setofitsconstituentvariableswhich
predictitwith
high
R2 .
Seriesof
dichotom
ousvariables:
�Construct
X1=
0-1accordingto
whether
any
variablespositive
�Construct
X2=
number
ofpositives
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S85
�Testwhether
originalvariablesaddto
X1or
X2
3.7.6
Sim
plifyingClusterSco
res
3.7.7
How
Much
Data
ReductionIs
Nece
ssary
?
Using
ExpectedShrinkage
toGuide
DataReduction
�Fitfullmodelwithallcandidates,pd.f.,LRlike-
lihoodratioχ2
�Com
pute
γ
�If<
0.9,
consider
shrunken
estimator
from
whole
model,or
data
reduction(again
notusingY)
�qregression
d.f.forreducedmodel
�Assum
ebestcase:discardeddimensionshadno
associationwithY
�Expectedloss
inLRisp−q
�New
shrinkage[LR−(p−q)−q]/[LR−(p−q)]
�Solve
forq→
q≤
(LR−p)/9
�Under
theseassumptions,no
hopeunless
original
LR>
p+9
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S86
�Noχ2lostby
dimension
reduction→
q≤
LR/10
Example:
�Binarylogisticmodel,45
events
on150subjects
�10:1
rule→
analyze4.5d.f.total
�Analyst
wishesto
includeage,sex,
10others
�Not
know
nifagelinearor
ifageandsexadditive
�4knots→
3+1+1d.f.forageandsexifrestrict
interactionto
belinear
�Fullmodelwith15
d.f.hasLR=50
�Expectedshrinkagefactor
(50−15)/50
=0.7
�LR>
15+9=24→
reductionmay
help
�Reduction
toq=(50−15)/9≈
4d.f.necessary
�Haveto
assumeagelinear,reduce
other10
to1
d.f.
�Separatehypothesistestsintended→
usefullmodel,
adjust
formultiplecomparisons
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S87
Summary
ofSomeData
Reduction
Meth
ods
Goals
Reaso
ns
Meth
ods
Grouppredictorsso
that
each
group
represents
asingledim
ension
that
canbesummarized
with
asinglescore
•↓
d.f.
arising
from
multiplepredictors
•MakePC
1morerea-
sonablesummary
Variableclustering
•Subject
matter
know
ledge
•Group
predictors
tomaxim
ize
prop
ortion
ofvariance
explained
byPC
1of
each
group
•Hierarchical
cluster-
ing
using
amatrix
ofsimilarity
measures
betweenpredictors
Transform
predictors
•↓d.f.dueto
nonlin-
earanddummyvari-
ablecomponents
•Allows
predictors
tobe
optimally
com-
bined
•MakePC
1morerea-
sonablesummary
•Use
incustom
ized
model
for
imputing
missing
values
oneach
predictor
•Maxim
um
totalvari-
ance
onagroupof
re-
latedpredictors
•Canonical
variates
onthetotalset
ofpredic-
tors
Score
agroupof
predic-
tors
↓d.f.forgroupto
unity•PC
1
•Sim
plepointscores
Multiple
dim
ensional
scoringof
allpredictors
↓d.f.forallpredictors
combined
Principal
components
1,2,...,k,k
<p
computed
from
all
transformed
predictors
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S88
3.8
OverlyIn
fluentialObservations
�Every
observationshould
influencefit
�Major
resultsshould
notrest
on1or
2obs.
�Overlyinfl.obs.→↑variance
ofpredictions
�Alsoaffects
variableselection
Reasons
forinfluence:
�Too
fewobservations
forcomplexityof
model(see
Sections3.7,
3.3)
�Datatranscriptionor
entryerrors
�Extremevalues
ofapredictor
1.Som
etimes
subjectso
atypical
should
remove
from
dataset
2.Som
etimes
truncate
measurements
where
data
densityends
3.Example:
n=
4000,2000
deaths,white
blood
countrange500-100,000,.05,.95quantiles=
2755,
26700
4.Linearsplinefunction
fit
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S89
5.Sensitive
toWBC>
60000(n
=16)
6.Predictions
stable
iftruncate
WBC
to40000
(n=46
above40000)
�Disagreem
ents
betweenpredictors
andresponse.
Ignore
unless
extrem
evalues
oranotherexplana-
tion
�Example:
n=
8000,oneextrem
epredictorvalue
noton
straight
linerelationshipwithother(X
,Y)
→χ2=36
forH0:linearity
StatisticalMeasures:
�Leverage:
capacity
tobeinfluential(notnecessar-
ilyinfl.)
Diagonalsof“hat
matrix”
H=X(X′ X
)−1 X′—
measureshowan
obs.
predictsitsow
nresponse
10
�hii
>2(p+
1)/n
may
signal
ahigh
leverage
point
10
�DFBETAS:changeinβupon
deletion
ofeach
obs,
scaled
bys.e.
�DFFIT:change
inXβupon
deletion
ofeach
obs
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S90
�DFFITS:DFFIT
standardized
bys.e.
ofβ
�Som
eclassifyobsas
overlyinfluentialw
hen|DFFITS|>
2√
(p+1)/(n−p−1)
10
�Othersexam
ineentire
distribution
for“outliers”
�Nosubstituteforcarefulexaminationofdata23,102
�Maximum
likelihoodestimationrequires1-step
ap-
proximations
3.9
ComparingTwoM
odels
�Level
playingfield(independent
datasets,same
no.candidated.f.,carefulbootstrapping)
�Criteria:
1.calibration
2.discrimination
3.face
validity
4.measurementerrorsin
required
predictors
5.useof
continuous
predictors
(which
areusually
betterdefinedthan
categoricalones)
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S91
6.om
ission
of“insignificant”variablesthatnonethe-
less
makesenseas
risk
factors
7.simplicity(thoughthisislessimportant
withthe
availabilityof
computers)
8.lack
offitforspecifictypes
ofsubjects
�Goalisto
rank-order:ignore
calibration
�Otherwise,
dism
issamodel
having
poorcalibra-
tion
�Goodcalibration→
compare
discrimination(e.g.,
R286,model
χ2 ,
Som
ers’
Dxy,Spearm
an’s
ρ,
area
underROCcurve)
�Worthwhileto
compare
modelson
ameasure
not
used
tooptimizeeithermodel,e.g.,meanabsolute
error,medianabsolute
errorifusingOLS
�Rankmeasuresmay
notgive
enough
creditto
ex-
trem
epredictions→
model
χ2 ,R2 ,
exam
ineex-
trem
esof
distribution
ofY
�Examinedifferencesin
predictedvalues
from
the
twomodels
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S92
�See
90,91fordiscussionsandexam
plesoflowpow
erfortestingdifferencesin
ROCareas.
3.10
Summary
:Possible
ModelingStrategies
Greenland
56discussesmanyimportant
points:
�Stepw
isevariableselectionon
confoundersleaves
important
confoundersuncontrolled
�Shrinkage
isfarsuperiorto
variableselection
�Variableselectiondoesmoredamageto
confidence
intervalwidthsthan
topoint
estimates
�Claimsaboutunbiasedness
ofordinary
MLEsare
misleadingbecause
they
assumethemodeliscor-
rect
andistheonlymodelentertained
�“m
odelsneed
tobecomplex
tocaptureuncertainty
abouttherelations...anhonest
uncertaintyas-
sessmentrequires
parametersforalleffects
that
weknow
may
bepresent.
Thisadvice
isimplicit
inan
antiparsimonyprincipleoftenattributed
to
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S93
L.J.
Savage‘Allmodelsshould
beas
bigas
anelephant’(see
Draper,1995)”
GlobalStrategies
�Use
amethodknow
nnotto
workwell(e.g.,step-
wisevariableselectionwithout
penalization;
recur-
sive
partitioning),documenthowpoorlythemodel
perform
s(e.g.usingthebootstrap),
andusethe
modelanyw
ay
�Develop
ablackbox
model
that
perform
spoorly
andisdifficultto
interpret(e.g.,does
notincor-
poratepenalization)
�Develop
ablackbox
modelthat
perform
swelland
isdifficultto
interpret
�Develop
interpretableapproximations
totheblack
box
�Develop
aninterpretablemodel(e.g.give
priority
toadditive
effects)that
perform
swelland
islikely
toperform
equally
wellon
future
data
from
the
samestream
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S94
Preferred
Strategyin
aNutshell
�Decidehowmanyd.f.canbespent
�Decidewhere
tospendthem
�Spendthem
�Don’treconsider,especially
ifinferenceneeded
3.10.1
DevelopingPredictiveM
odels
1.Assem
bleaccurate,pertinent
data
andlots
ofit,
withwidedistributionsforX.
2.Formulategood
hypotheses—
specifyrelevant
candidatepredictorsandpossibleinteractions.Don’t
useY
todecide
which
X’sto
include.
3.Characterizesubjectswithmissing
Y.Deletesuch
subjectsinrarecircum
stances32 .
Forcertainmod-
elsitiseffective
tomultiplyimpute
Y.
4.Characterizeandimputemissing
X.Inmostcases
usemultipleimputation
basedon
XandY
5.For
each
predictorspecifycomplexityor
degree
ofnonlinearity
that
should
beallowed
(morefor
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S95
important
predictorsor
forlargen)(Section
3.1)
6.Dodata
reductionifneeded
(pre-transform
ations,
combinations),or
usepenalized
estimation61
7.Use
theentire
samplein
modeldevelopm
ent
8.Can
dohighly
structured
testingto
simplify“ini-
tial”model
(a)Testentire
groupof
predictorswithasingleP-
value
(b)Makeeach
continuous
predictorhave
samenum-
ber
ofknots,andselect
thenumber
that
opti-
mizes
AIC
(c)Testthecombinedeffectsof
allnonlinearterm
swithasingleP-value
9.Maketestsof
linearity
ofeffectsinthemodelonly
todemonstrate
toothersthat
such
effects
areof-
tenstatistically
significant.Don’tremoveindivid-
ualinsignificant
effects
from
themodel.
10.C
heck
additivityassumptions
bytestingpre-specified
interactionterm
s.Use
aglobal
test
andeither
keep
allor
delete
allinteractions.
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S96
11.C
heck
toseeifthereareoverly-influentialobser-
vations.
12.C
heck
distributionalassumptions
andchooseadif-
ferent
modelifneeded.
13.D
olim
ited
backwards
step-dow
nvariableselection
ifparsimonyismoreimportant
that
accuracy
103 .
But
confidencelim
its,etc.,mustaccountforvari-
ableselection(e.g.,bootstrap).
14.T
hisisthe“final”model.
15.Interpret
themodel
graphically
andby
comput-
ingpredictedvaluesandappropriateteststatistics.
Com
pute
pooledtestsof
associationforcollinear
predictors.
16.V
alidatethismodelforcalibration
anddiscrimina-
tion
ability,preferablyusingbootstrapping.
17.Shrinkparameter
estimates
ifthereisoverfitting
butno
furtherdata
reductionis
desired(unless
shrinkagebuilt-into
estimation)
18.W
henmissing
values
wereimputed,
adjust
final
variance-covariancematrixforimputation.Dothis
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S97
asearlyas
possiblebecauseitwillaffectotherfind-
ings.
19.W
henallstepsof
themodelingstrategy
canbe
automated,considerusingFaraw
ay’smethod45
topenalizefortherandom
ness
inherent
inthemul-
tiplesteps.
20.D
evelop
simplificationsto
thefinalm
odelas
needed.
3.10.2
DevelopingM
odels
forEffect
Estim
ation
1.Lessneed
forparsimony;even
lessneed
toremove
insignificant
variablesfrom
model(otherwiseCLs
toonarrow
)
2.Carefulconsiderationofinteractions;inclusionforces
estimates
tobeconditionalandraises
variances
3.If
variable
ofinterest
ismostlytheonethat
ismissing,multipleimputation
less
valuable
4.Com
plexityof
mainvariablespecified
bypriorbe-
liefs,comprom
isebetweenvariance
andbias
5.Don’tpenalizeterm
sforvariableof
interest
CHAPTER
3.
MULTIVARIA
BLE
MODELIN
GSTRATEGIE
S98
6.Modelvalidationless
necessary
3.10.3
DevelopingM
odels
forHypoth
esisTesting
1.Virtuallysameas
previous
strategy
2.Interactions
requiretestsof
effectby
varyingval-
uesof
anothervariable,or“m
aineffect+
inter-
action”jointtests(e.g.,istreatm
enteffective
for
either
sex,
allowingeffects
tobedifferent)
3.Validationmay
help
quantify
overadjustment
Chapter4
Describing,Resampling,Validating,and
Sim
plifyingth
eM
odel
4.1
Describingth
eFitted
Model
4.1.1
Interp
retingEffects
�Regressioncoeffi
cients
if1d.f.per
factor,no
in-
teraction
�Notstandardized
regression
coeffi
cients
�Manyprogramsprintmeaningless
estimates
such
aseffectof
increasing
age2
byoneunit,holding
ageconstant
�Needto
accountfornonlinearity,interaction,
and
99
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
100
usemeaningfulranges
�Formonotonicrelationships,estimateXβatquar-
tilesof
continuous
variables,separatelyforvarious
levelsof
interactingfactors
�Subtractestimates,anti-log,e.g.,to
getinter-
quartile-range
odds
orhazardsratios.BaseC.L.
ons.e.
ofdifference.
�Ploteffectofeach
predictoron
Xβor
sometrans-
form
ationof
Xβ.See
also
69.
�Nom
ogram
�Use
regression
tree
toapproximatethefullmodel
4.1.2
IndexesofM
odelPerform
ance
ErrorM
easu
res
�Centraltendency
ofprediction
errors
–Meanabsolute
prediction
error:mean|Y−Y|
–Meansquaredprediction
error
*BinaryY:Brier
score(quadraticproper
scor-
ingrule)
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
101
–Logarithm
icproperscoringrule(avg.log-likelihood)
�Discriminationmeasures
–Purediscrimination:
rank
correlationof
(Y,Y
)
*Spearm
anρ,Kendallτ,Som
ers’Dxy
*Y
binary→
Dxy=2×(C−
1 2)
C=concordanceprobability=area
underre-
ceiveroperatingcharacteristiccurve∝Wilcoxon-
Mann-Whitney
statistic
–Mostlydiscrimination:
R2
*R2 adj—
overfittingcorrectedifmodelpre-specified
–Brier
scorecanbedecomposed
into
discrimina-
tion
andcalibration
components
–Discriminationmeasuresbasedon
variationin
Y *regression
sum
ofsquares
*g–index
�Calibration
measures
–calibration–in–the–large:
averageYvs.average
Y
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
102
–high-resolutioncalibration
curve(calibration–in–
the–sm
all)
–calibration
slopeandintercept
–maximum
absolute
calibration
error
–meanabsolute
calibration
error
–0.9quantileof
calibration
error
g–In
dex
�Based
onGini’s
meandifference
–meanover
allpossiblei6=jof|Z
i−Zj|
–interpretable,robust,highlyeffi
cientmeasureof
variation
�g=
Gini’s
meandifference
ofXiβ
=Y
�Example:
Y=systolicbloodpressure;g=11mmHg
istypicaldifference
inY
�Independent
ofcensoringetc.
�For
modelsin
which
anti-log
ofdifference
inY
representmeaningfulratios
(oddsratios,hazard
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
103
ratios,ratioof
medians):
g r=exp(g)
�For
modelsinwhich
Ycanbeturned
into
aprob-
ability
estimate(e.g.,logisticregression):
g p=
Gini’s
meandifference
ofP
�These
g–indexes
represente.g.“typical”odds
ra-
tios,“typical”risk
differences
�Can
define
partialg
4.2
TheBootstrap
�Ifknow
populationmodel,usesimulationor
an-
alytic
derivationsto
studybehaviorof
statistical
estimator
�SupposeY
hasacumulativedist.fctn.F(y)=
Prob{Y≤
y}
�Wehave
sampleof
size
nfrom
F(y),
Y1,Y2,...,Yn
�Steps:
1.Repeatedlysimulatesampleof
size
nfrom
F
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
104
2.Com
pute
statisticof
interest
3.Study
behaviorover
Brepetitions
�Example:
1000
samples,1000
samplemedians,
compute
theirsamplevariance
�F
unknow
n→
estimateby
empiricaldist.fctn.
Fn(y)=
1 n
n∑ i=1I(Y
i≤
y),
where
I(w
)is1ifw
istrue,0otherwise.
�Example:
sampleof
size
n=
30from
anorm
aldistribution
withmean100andSD10
set.seed(6)
x←
rnorm
(30,
100,
20)
xs←
seq(5
0,
150,
length
=150)
cdf←
pnorm
(xs,
100,
20)
plo
t(xs,
cdf,
type=
'l',
ylim=c(0,1
),
xlab=expre
ssio
n(x),
ylab=expre
ssio
n(paste
(”Prob[”,X≤
x,
”]”)))
lines(ecdf(x),
cex=
.5)
�Fncorresponds
todensityfunction
placingprob-
ability
1 nat
each
observed
data
point
(k nifpoint
duplicated
ktimes)
�Pretend
that
F≡
Fn
�Sam
plingfrom
Fn≡
samplingwithreplacem
ent
from
observed
data
Y1,...,Yn
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
105
6080
100
120
140
0.00.20.40.60.81.0
x
Prob[X≤x]
●
●
●●●
●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
Figure
4.1:
Empiricalandpopulationcumulativedistributionfunctions
�Large
n→
selects1−
e−1≈
0.632oforiginaldata
pointsin
each
bootstrap
sampleat
leastonce
�Som
eobservations
notselected,others
selected
morethan
once
�Efron’s
bootstrap→
general-purposetechnique
forestimatingproperties
ofestimatorswithout
as-
sumingor
know
ingdistribution
ofdata
F
�TakeB
samplesofsize
nwithreplacem
ent,choose
Bso
that
summarymeasureofindividualstatistics
≈summaryifB
=∞
�Bootstrap
basedon
distribution
ofobserved
dif-
ferences
betweenaresampled
parameter
estimate
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
106
andtheoriginalestimatetelling
usaboutthedis-
tributionof
unobservable
differencesbetweenthe
originalestimateandtheunknow
nparameter
Example:
Data(1,5,6,7,8,9),
obtain
0.80
confi-
denceinterval
forpopulationmedian,
andestimate
ofpopulationexpectedvalueof
samplemedian(only
toestimatethebias
intheoriginal
estimateof
the
median).
options(dig
its=3)
y←
c(2,5
,6,7
,8,9
,10,1
1,1
2,1
3,1
4,1
9,2
0,2
1)
y←
c(1,5
,6,7
,8,9
)set.seed(17)
n←
length
(y)
n2←
n/2
n21←
n2+1
B←
400
M←
double
(B)
plo
t(0,
0,
xlim=c(0,B
),
ylim=c(3,9
),
xlab=”Bootstrap
Samples
Used”,
ylab=”M
ean
and
0.1
,0.9
Quantiles”,
type=
”n”)
for(i
in1:B
){
s←
sample
(1:n
,n,
repla
ce=T)
x←
sort
(y[s])
m←
.5*(x[n
2]+
x[n21])
M[i]←
mif
(i≤
20){
w←
as.c
hara
cter(x)
cat(w
,”&
&”,
sprin
tf('%
.1f',m
),
if(i<
20)
”\\\\\n”
else
”\\\\\\hline\n”
file=
'∼/doc/rm
s/validate/tab.tex
',
append=i>
1)
} poin
ts(i,
mean(M
[1:i])
,pch=46)
if(i≥
10)
{q←
quantile(M
[1:i],
c(.1
,.9
))
poin
ts(i,
q[1
],
pch=46,
col=
'blu
e')
poin
ts(i,
q[2
],
pch=46,
col=
'blu
e')
}}
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
107
table
(M)
M1
33.5
44.5
55.5
66.5
77.5
88.5
96
10
78
223
43
75
59
66
47
42
11
1
hist(M
,ncla
ss=
length
(unique(M
)),
xlab=””,
main=””)
First
20samples:
BootstrapSam
ple
Sam
pleMedian
166789
6.5
155568
5.0
578999
8.5
777889
7.5
157799
7.0
156678
6.0
788888
8.0
555799
6.0
155779
6.0
155778
6.0
115577
5.0
115578
5.0
155778
6.0
156788
6.5
156799
6.5
667789
7.0
157889
7.5
668999
8.5
115569
5.0
168999
8.5
�Histogram
tells
uswhether
wecanassumenor-
malityforthebootstrap
medians
orneed
touse
quantilesof
medians
toconstructC.L.
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
108
010
020
030
040
0
3456789
Boo
tstr
ap S
ampl
es U
sed
Mean and 0.1, 0.9 Quantiles
Frequency
24
68
0204060
Figure
4.2:
Estim
atingproperties
ofsample
medianusingthebootstrap
�Needhigh
Bforquantiles,low
forvariance
(but
see[14])
4.3
ModelValidation
4.3.1
Introduction
�Externalvalidation(best:
anothercountryat
an-
othertime);also
validatessampling,measurements
�Internal
–apparent
(evaluatefiton
samedata
used
tocre-
atefit)
–data
splitting
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
109
–cross-validation
–bootstrap:getoverfitting-correctedaccuracy
in-
dex
�Bestway
tomakemodelfitdata
wellisto
discard
muchof
thedata
�Predictions
onanotherdatasetwillbeinaccurate
�Needunbiased
assessmentof
predictive
accuracy
4.3.2
WhichQuantitiesShould
BeUsedin
Validation?
�OLS:R
2isonegood
measureforquantifyingdrop-
offin
predictive
ability
�Example:
n=
10,p
=9,
apparent
R2=
1but
R2willbecloseto
zero
onnewsubjects
�Example:
n=
20,p
=10,apparent
R2=
.9,R2
onnewdata
0.7,
R2 adj
=0.79
�AdjustedR2solves
muchof
thebias
problem
as-
sumingpin
itsform
ulaisthelargestnumber
ofparametersever
exam
ined
againstY
�Few
otheradjusted
indexesexist
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
110
�Alsoneed
tovalidatemodelswithphantom
d.f.
�Cross-validationorbootstrap
canprovideunbiased
estimateof
anyindex;bootstrap
hashigher
preci-
sion
�Twomaintypes
ofquantities
tovalidate
1.Calibration
orreliability:
ability
tomakeunbi-
ased
estimates
ofresponse
(Yvs.Y)
2.Discrimination:
ability
toseparate
responses
OLS:R2 ;
g–index;binary
logisticmodel:ROC
area,equivalentto
rank
correlationbetweenpre-
dicted
probability
ofeventand0/1event
�Unbiasedvalidationnearlyalwaysnecessary,to
de-
tect
overfitting
4.3.3
Data-S
plitting
�Splitdata
into
trainingandtest
sets
�Interestingto
compare
indexof
accuracy
intrain-
ingandtest
�Freezeparametersfrom
training
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
111
�Makesure
youallow
R2=
1−
SSE/S
ST
for
test
sampleto
be<
0
�Don’t
compute
ordinary
R2on
Xβ
vs.Y;this
allowsforlinearrecalibration
aXβ+bvs.Y
�Testsamplemustbelargeenough
toobtain
very
accurate
assessmentof
accuracy
�Trainingsampleiswhat’sleft
�Example:
overallsam
plen=300,training
sample
n=200,
developmodel,freeze
β,predicton
test
sample(n
=100),R2=1−
∑
(Yi−
Xiβ)2
∑
(Yi−
Y)2
.
�Disadvantages
ofdata
splitting:
1.Costlyin↓n16,95
2.Requiresdecisionto
split
atbeginning
ofanal-
ysis
3.Requireslargersampleheldoutthan
cross-validation
4.Results
vary
ifsplit
again
5.Doesnotvalidatethefinalmodel(from
recom-
bineddata)
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
112
6.Not
helpfulin
gettingCLcorrectedforvar.se-
lection
4.3.4
Improvements
onData-S
plitting:Resampling
�Nosacrifice
insamplesize
�Workwhenmodelingprocessautomated
�Bootstrap
excellent
forstudying
arbitrarinessof
variableselection98
�Cross-validationsolvesmanyproblemsofdatasplit-
ting
40,100,111,123
�Exampleof×-validation:
1.Splitdata
atrandom
into
10tenths
2.Leave
out
1 10of
data
atatime
3.Develop
modelon
9 10,includinganyvariablese-
lection,
pre-testing,
etc.
4.Freezecoeffi
cients,evaluate
on1 10
5.Average
R2over
10reps
�Drawbacks:
1.Choiceof
number
ofgroups
andrepetitions
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
113
2.Doesn’tshow
fullvariability
ofvar.selection
3.Doesnotvalidatefullmodel
4.Low
erprecisionthan
bootstrap
5.Needto
do50
repeatsof10-foldcross-validation
toensure
adequate
precision
�Randomizationmethod
1.RandomlypermuteY
2.Optimism
=perform
ance
offitted
modelcom-
paredto
whatexpectby
chance
4.3.5
Validation
Usingth
eBootstrap
�Estimateoptimism
offinalwholesamplefitwith-
outholdingoutdata
�From
original
XandY
select
sampleof
size
nwithreplacem
ent
�Derivemodelfrom
bootstrap
sample
�Applyto
originalsample
�Simplebootstrap
uses
averageof
indexescom-
putedon
originalsample
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
114
�Estimated
optimism
=difference
inindexes
�RepeataboutB
=100times,getaverageex-
pectedoptimism
�Subtractaverageoptimismfrom
apparent
indexin
finalmodel
�Example:
n=1000,have
developed
afinalm
odel
that
ishopefullyreadyto
publish.
Callestimates
from
thisfinalmodelβ.
–finalmodelhasapparent
R2(R
2 app)=0.4
–howinflated
isR2 app?
–getresamples
ofsize
1000
withreplacem
ent
from
original1000
–foreach
resamplecompute
R2 boot
=apparent
R2in
bootstrap
sample
–freeze
thesecoeffi
cients(callthemβboot),apply
tooriginal(w
hole)sample(X
orig,Y
orig)to
get
R2 orig=R2 (Xorigβboot,Y
orig)
–optimism
=R2 boot−R2 orig
–averageoverB
=100optimismsto
getoptimism
–R2 overfittingcorrected=R2 app−optimism
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
115
�Isestimatingunconditional(notconditionalonX)
distribution
ofR2 ,etc.[45,
p.217]
�Conditionalestimates
would
requireassumingthe
modeloneistrying
tovalidate
�Efron’s“.632”methodmay
perform
better(reduce
bias
further)forsm
alln40,[41,
p.253],42
Bootstrap
usefulforassessingcalibration
inaddition
todiscrimination:
�FitC(Y|X
)=Xβon
bootstrap
sample
�Re-fitC(Y|X
)=γ0+γ1X
βon
samedata
�γ0=0,γ1=1
�Testdata
(originaldataset):re-estimateγ0,γ1
�γ1<
1ifoverfit,γ0>
0to
compensate
�γ1quantifies
overfittinganduseful
forimproving
calibration
103
�UseEfron’smethodto
estimateoptimismin(0,1),
estimate(γ
0,γ1)
bysubtractingoptimism
from
(0,1)
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
116
�See
also
Copas
30andvanHouwelingenandle
Cessie[111,p.
1318]
See
[47]
forwarningsaboutthebootstrap,and[40]
forvariations
onthebootstrap
toreduce
bias.
Use
bootstrap
tochoose
betweenfullandreduced
models:
�Bootstrap
estimateof
accuracy
forfullmodel
�Repeat,
usingchosen
stopping
rule
foreach
re-
sample
�Fullfitusually
outperform
sreducedmodel103
�Stepw
isemodelingoftenreducesoptimismbutthis
isnotoff
setby
loss
ofinform
ationfrom
deleting
marginalvar.
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
117
Method
ApparentRank
Over-
Bias-Corrected
Correlation
ofOptimism
Correlation
Predictedvs.
Observed
FullModel
0.50
0.06
0.44
Stepw
iseModel
0.47
0.05
0.42
Inthis
exam
ple,
stepwisemodelinglost
apossible
0.50−
0.47
=0.03
predictive
discrimination.
The
fullmodelfitwillespecially
bean
improvem
entwhen
1.The
stepwiseselectiondeletedseveralvariables
which
werealmostsignificant.
2.These
marginalvariables
have
somerealpredictive
value,even
ifit’sslight.
3.There
isno
smallsetof
extrem
elydominantvari-
ablesthat
would
beeasily
foundby
stepwisese-
lection.
Other
issues:
�See
[111]formanyinterestingideas
�Faraw
ay45
show
showbootstrap
isused
topenal-
izeforchoosing
transformations
forY,outlierand
influencechecking,variableselection,
etc.
simul-
taneously
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
118
�Brownstone
[20,p.
74]feelsthat“theoreticalstatis-
ticianshave
beenunableto
analyzethesampling
properties
of[usual
multi-stepmodelingstrate-
gies]underrealisticconditions”andconcludesthat
themodelingstrategy
mustbecompletelyspeci-
fied
andthen
bootstrapped
togetconsistent
esti-
mates
ofvariancesandothersamplingproperties
�See
BlettnerandSauerbrei13
andChatfield24
for
moreinterestingexam
ples
ofproblemsresulting
from
data-drivenanalyses.
4.4
Sim
plifyingth
eFinalM
odelbyApproxim
atingIt
4.4.1
Diffi
cultiesUsingFullM
odels
�Predictions
areconditionalon
allvariables,stan-
dard
errors↑whenpredictforalow-frequency
cat-
egory
�Collinearity
�Can
averagepredictionsovercategoriesto
marginal-
ize,↓s.e.
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
119
4.4.2
Approxim
atingth
eFullM
odel
�Fullmodelisgold
standard
�Approximateitto
anydesireddegree
ofaccuracy
�Ifapprox.withatree,bestc-vtree
will
have
1obs./node
�Can
useleastsquaresto
approx.modelby
predict-
ingY
=Xβ
�Whenoriginal
model
also
fitusingleastsquares,
coef.ofapprox.m
odelagainstY≡
coef.ofsubset
ofvariablesfitted
againstY
(asin
stepwise)
�Modelapproximationstillhassomeadvantages
1.Usesunbiased
estimateof
σfrom
fullfit
2.Stoppingruleless
arbitrary
3.Inheritanceof
shrinkage
�If
estimates
from
fullmodel
areβ
andapprox.
model
isbasedon
asubset
Tof
predictors
X,
coef.of
approx.modelareW
β,where
W=(T′ T)−
1 T′ X
�Variancematrixof
reducedcoef.:W
VW′
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
120
4.5
How
DoW
eBreakBad
Habits?
�Insist
onvalidationof
predictive
modelsanddis-
coveries
�Showcollaboratorsthat
split-sam
plevalidationis
notappropriateunless
thenumber
ofsubjects
ishuge
–Splitmorethan
once
andseevolatileresults
–Calculateaconfidenceintervalforthepredictive
accuracy
inthetest
datasetandshow
that
itis
very
wide
�Run
simulationstudywithno
realassociations
and
show
that
associations
areeasy
tofind
�Analyze
thecollaborator’sdataafterrandom
lyper-
mutingtheY
vector
andshow
somepositivefind-
ings
�Showthatalternativeexplanations
areeasy
toposit
–Im
portanceof
arisk
factor
may
disappearif5
“unimportant”risk
factorsareaddedback
tothe
model
CHAPTER
4.
DESCRIB
ING,RESAMPLIN
G,VALID
ATIN
G,AND
SIM
PLIF
YIN
GTHE
MODEL
121
–Omittedmaineffects
canexplainapparent
in-
teractions
Chapter5
SSoftware
Sallowsinteractionsplinefunctions,widevarietyof
predictorparameterizations,widevarietyof
models,
unifying
model
form
ulalanguage,model
validation
byresampling.
Siscomprehensive:
�Easyto
write
Sfunctionsfornewmodels→
wide
varietyof
modernregression
modelsimplem
ented
(trees,nonparam
etric,ACE,AVAS,survivalmod-
elsformultipleevents)
�Designs
canbegeneratedforanymodel→
all
handle“class”var,
interactions,nonlinearexpan-
122
CHAPTER
5.
SSOFTWARE
123
sions
�SingleSobjects(e.g.,fitobject)canbeself-docum
enting
→automatichypothesistests,predictionsfornew
data
�Superiorgraphics
�Classes
andgenericfunctions
5.1
TheS
ModelingLanguage
Sstatisticalmodelinglanguage:
response∼
term
sy∼
age
+sex
#age
+sex
main
effects
y∼
age
+sex
+age:sex
#add
second-order
interaction
y∼
age*sex
#second-order
interaction
+
#all
main
effects
y∼
(age
+sex
+pre
ssure
)∧2
#age+sex+pressure+age:sex+age:pressure...
y∼
(age
+sex
+pre
ssure
)∧2−
sex:pre
ssure
#all
main
effects
and
all
2nd
order
#interactions
except
sex:pressure
y∼
(age
+ra
ce)*sex
#age+race+sex+age:sex+race:sex
y∼
treatm
ent*(age*ra
ce
+age*sex)
#no
interact.
with
race,sex
sqrt
(y)∼
sex*sqrt
(age)+
race
#functions,
with
dummy
variables
generated
if
#race
is
an
Sfactor
(classification)
variable
y∼
sex
+poly
(age,2
)#
poly
generates
orthogonal
polynomials
race.sex←
interactio
n(ra
ce,sex)
y∼
age
+ra
ce.sex
#for
when
you
want
dummy
variables
for
#all
combinations
of
the
factors
CHAPTER
5.
SSOFTWARE
124
The
form
ulaforaregression
modelisgivento
amod-
elingfunction,e.g.
lrm(y∼
rcs(x,4
))
isread
“use
alogisticregression
modelto
modelyas
afunction
ofx,
representing
xby
arestricted
cubic
splinewith4defaultknots”a .
updatefunction:re-fitmodelwithchangesin
term
sor
data:
f←
lrm(y∼
rcs(x,4
)+
x2
+x3)
f2←
update
(f,
subset=
sex=
=”male”)
f3←
update
(f,
.∼.−
x2)
#remove
x2
from
model
f4←
update
(f,
.∼.+
rcs(x5,5
))#
add
rcs(x5,5)
to
model
f5←
update
(f,
y2∼
.)
#same
terms,
new
response
var.
5.2
User-Contributed
Functions
�Sishigh-levelobject-orientedlanguage.
�S-Plus(U
NIX,Linux,MicrosoftWindows)
�R(U
NIX,Linux,Mac,Windows)
�Multitude
ofuser-contributed
functionsfreelyavail-
able
�Internationalcommunityof
users
alrmand
rcsare
inth
ermspackage.
CHAPTER
5.
SSOFTWARE
125
Som
eSfunctions:
�See
VenablesandRipley
�Hierarchicalclustering:hclust
�Principalcomponents:princomp,
prcomp
�Canonicalcorrelation:
cancor
�Nonparametrictransform-both-sidesadditive
mod-
els:
ace,
avas
�Param
etrictransform-both-sidesadditive
models:
areg,areg.boot(Hmiscpackagein
R,S-Plus))
�Rankcorrelationmethods:
rcorr,hoeffd,spearman2(Hmisc)
�Variableclustering:varclus(Hmisc)
�Singleimputation:transcan(Hmisc)
�Multipleimputation:aregImpute(Hmisc)
�Restrictedcubicsplines:
rcspline.eval(Hmisc)
�Re-staterestricted
splinein
simpler
form
:rcspline.restate(Hmisc)
CHAPTER
5.
SSOFTWARE
126
5.3
The
rmsPack
age
�datadistfunction
tocompute
predictordistribu-
tion
summaries
y∼
sex
+lsp(age,c(20,3
0,4
0,5
0,6
0))
+sex
%ia%
lsp(age,c(20,3
0,4
0,5
0,6
0))
E.g.restrict
age×
cholesterolinteractionto
beof
form
AF(B
)+BG(A
):y∼
lsp(age,3
0)
+rc
s(cholestero
l,4
)+
lsp(age,3
0)%ia%
rcs(cholestero
l,4
)
Specialfittingfunctionsby
Harrellto
simplifyproce-
duresdescribed
inthesenotes:
Table
5.1:
rmsFittingFunctions
Function
Purpose
Related
SFunctions
ols
Ordinaryleastsquares
linearmodel
lm
lrm
Binaryan
dordinal
logistic
regression
model
glm
Has
option
sforpenalized
MLE
psm
Accelerated
failure
timeparam
etricsurvival
survreg
models
cph
Cox
proportion
alhazardsregression
coxph
bj
Buckley-Jam
escensoredleastsquares
model
survreg,lm
Glm
rmsversionof
glm
glm
Gls
rmsversionof
gls
gls(nlmepackage)
Rq
rmsversionof
rq
rq(quantregpackage)
CHAPTER
5.
SSOFTWARE
127
Table
5.2:
rmsTransform
ationFunctions
Function
Purpose
Related
SFunctions
asis
Nopost-tran
sformation(seldom
usedexplicitly)
I
rcs
Restrictedcubic
splines
ns
pol
Polynom
ialusingstan
dardnotation
poly
lsp
Linearspline
catg
Categorical
predictor(seldom
)factor
scored
Ordinal
categoricalvariab
les
ordered
matrx
Keepvariab
lesas
grou
pforanovaan
dfastbw
matrix
strat
Non
-modeled
stratification
factors
strata
(usedforcphon
ly)
Function
Purpose
Related
Functions
Printparam
etersandstatistics
offit
coef
Fittedregression
coeffi
cients
formula
Formula
usedin
thefit
specs
Detailedspecificationsof
fit
vcov
Fetch
covariance
matrix
logLik
Fetch
maxim
ized
log-likelihood
AIC
Fetch
AIC
withoption
toputon
chi-squarebasis
lrtest
Likelihoodratiotest
fortwonestedmodels
univarLR
Com
pute
allunivariableLRχ2
robcov
Robust
covariance
matrixestimates
bootcov
Bootstrapcovariance
matrixestimates
andbootstrapdistributionsof
estimates
pentrace
Findoptimum
penalty
factorsby
tracing
effective
AIC
foragrid
ofpenalties
effective.df
Printeff
ective
d.f.foreach
typeof
variable
inmodel,forpenalized
fitor
pentraceresult
summary
Summaryof
effects
ofpredictors
plot.summary
Plotcontinuouslyshaded
confidence
bars
forresultsof
summary
anova
Waldtestsof
mostmeaningfulhypotheses
plot.anova
Graphical
depiction
ofanova
contrast
General
contrasts,C.L.,tests
gendata
Easily
generatepredictorcombinations
predict
Obtain
predictedvalues
ordesignmatrix
Predict
Obtain
predictedvalues
andconfidence
limitseasily
varyingasubsetof
predictors
andotherssetat
defaultvalues
plot.Predict
Ploteff
ects
ofpredictors
fastbw
Fastbackw
ardstep-dow
nvariableselection
step
residuals
(orresid)Residuals,influence
statsfrom
fit
sensuc
Sensitivity
analysisforunmeasuredconfounder
which.influence
Whichobservationsareoverly
influential
residuals
latex
LATEX
representation
offitted
model
Function
Function
Sfunctionanalyticrepresentation
ofXβ
latex
from
afitted
regression
model
CHAPTER
5.
SSOFTWARE
128
Function
Purpose
Related
Functions
Hazard
Sfunctionanalyticrepresentation
ofafitted
hazardfunction(for
psm)
Survival
Sfunctionanalyticrepresentation
offitted
survival
function(for
psm,cph)
Quantile
Sfunctionanalyticrepresentation
offitted
functionforquantilesof
survival
time
(for
psm,cph)
Mean
Sfunctionanalyticrepresentation
offitted
functionformeansurvival
timeor
forordinal
logistic
nomogram
Drawsanom
ogram
forthefitted
model
latex,plot
survest
Estim
atesurvival
probabilities
(psm,cph)
survfit
survplot
Plotsurvival
curves
(psm,cph)
plot.survfit
validate
Validateindexes
ofmodelfitusingresampling
val.prob
External
validationof
aprobability
model
lrm
val.surv
External
validationof
asurvival
model
calibrate
calibrate
Estim
atecalibration
curveusingresampling
val.prob
vif
Variance
inflationfactorsforfitted
model
naresid
Bringelem
ents
correspondingto
missingdata
backinto
predictionsandresiduals
naprint
Printsummaryof
missingvalues
impute
Impute
missingvalues
aregImpute
Example:
�treat:categoricalvariablewithlevels"a","b","c"
�num.diseases:ordinalvariable,0-4
�age:continuous
Restrictedcubicspline
�cholesterol:continuous
(3missings;usemedian)
log(cholesterol+10)
�Allow
treat×
cholesterolinteraction
CHAPTER
5.
SSOFTWARE
129
�Program
tofitlogistic
model,test
alleffects
indesign,estimateeffects
(e.g.inter-quartile
range
odds
ratios),plot
estimated
transformations
require(rm
s)
#make
new
functions
available
ddist←
datadist(cholestero
l,
treat,
num.d
iseases,
age)
#Could
have
used
ddist←
datadist(data.frame.name)
options(datadist=
”ddist”)
#defines
data
dist.
to
rms
chole
sterol←
impute
(chole
sterol)
fit←
lrm(y∼
tre
at
+score
d(num.d
iseases)+
rcs(age)+
log(chole
sterol+
10)+
tre
at:log(chole
sterol+
10))
describe(y∼
tre
at
+score
d(num.d
iseases)+
rcs(age))
#or
use
describe(formula(fit))
for
all
variables
used
in
fit
#describe
function
(in
Hmisc)
gets
simple
statistics
on
variables
#fit←
robcov(fit)
#Would
make
all
statistics
that
follow
#use
arobust
covariance
matrix
#would
need
x=T,
y=T
in
lrm()
specs(fit
)#
Describe
the
design
characteristics
anova(fit
)anova(fit,
treat,
chole
sterol)
#Test
these
2by
themselves
plo
t(anova(fit
))
#Summarize
anova
graphically
summary(fit
)#
Estimate
effects
using
default
ranges
plo
t(summary(fit
))
#Graphical
display
of
effects
with
C.I.
summary(fit,
tre
at=
”b”,
age=60)
#Specify
reference
cell
and
adjustment
val
summary(fit,
age=
c(50,7
0))
#Estimate
effect
of
increasing
age
from
#50
to
70
summary(fit,
age=
c(50,6
0,7
0))
#Increase
age
from
50
to
70,
adjust
to
#60
when
estimating
effects
of
other
#factors
#If
had
not
defined
datadist,
would
have
to
define
ranges
for
all
var.
#Estimate
and
test
treatment
(b-a)
effect
averaged
over
3cholesterols
contra
st(fit,
list(tre
at=
'b',
chole
sterol=
c(150,2
00,2
50)),
list(tre
at=
'a',
chole
sterol=
c(150,2
00,2
50)),
type=
'avera
ge
')
#See
the
help
file
for
contrast.rms
for
several
examples
of
#how
to
obtain
joint
tests
of
multiple
contrasts.
p←
Pre
dict(fit,
age=
seq(20,8
0,length
=100),
treat,
conf.in
t=FALSE)
plo
t(p)
#Plot
relationship
between
age
and
log
#odds,
separate
curve
for
each
treat,
#no
C.I.
plo
t(p,∼
age|
tre
at)
#Same
but
2panels
bplo
t(Pre
dict(fit,
age,
cholestero
l,
np=50))
CHAPTER
5.
SSOFTWARE
130
#3-dimensional
perspective
plot
for
age,
#cholesterol,
and
log
odds
using
default
#ranges
for
both
variables
plo
t(Pre
dict(fit,
num.d
iseases,
fun=fu
nctio
n(x)
1/(1+exp(−
x)),
conf.in
t=.9
),
ylab=”Prob”)
#Plot
estimated
probabilities
instead
of
#log
odds
#Again,
if
no
datadist
were
defined,
would
have
to
tell
plot
all
limits
logit←
pre
dict(fit,
expand.g
rid(tre
at=
”b”,n
um.d
is=1:3
,age=
c(20,4
0,6
0),
chole
sterol=
seq(1
00,300,length
=10)))
#Could
also
obtain
list
of
predictor
settings
interactively}
logit←
pre
dict(fit,
gendata
(fit,
nobs=12))
#Since
age
doesn't
interact
with
anything,
we
can
quickly
and
#interactively
try
various
transformations
of
age,
taking
the
spline
#function
of
age
as
the
gold
standard.
We
are
seeking
alinearizing
#transformation.
ag←
10:80
logit←
pre
dict(fit,
expand.g
rid(tre
at=
”a”,
num.d
is=0,age=
ag,
chole
sterol=
median(chole
sterol)),
type=
”te
rms”)[,”age”]
#Note:
if
age
interacted
with
anything,
this
would
be
the
age
#"main
effect"
ignoring
interaction
terms
#Could
also
use
#logit←
Predict(f,
age=ag,
...)$yhat,
#which
allows
evaluation
of
the
shape
for
any
level
of
interacting
#factors.
When
age
does
not
interact
with
anything,
the
result
from
#predict(f,
...,
type="terms")
would
equal
the
result
from
#Predict
if
all
other
terms
were
ignored
#Could
also
specify
#logit←
predict(fit,
gendata(fit,
age=ag,
cholesterol=...))
#Un-mentioned
variables
set
to
reference
values
plo
t(ag∧.5
,lo
git
)#
try
square
root
vs.
spline
transform.
plo
t(ag∧1.5
,lo
git
)#
try
1.5
power
latex(fit
)#
invokes
latex.lrm,
creates
fit.tex
#Draw
anomogram
for
the
model
fit
plo
t(nomogram(fit
))
#Compose
Sfunction
to
evaluate
linear
predictors
analytically
g←
Function(fit
)g(tre
at=
'b',
chole
sterol=
260,
age=50)
#Letting
num.diseases
default
to
reference
value
Toexam
ineinteractions
inasimpler
way,youmay
wantto
groupageinto
tertiles:
CHAPTER
5.
SSOFTWARE
131
age.t
ertile←
cut2
(age,
g=3)
#For
automatic
ranges
later,
add
age.tertile
to
datadist
input
fit←
lrm(y∼
age.t
ertile
*rc
s(chole
sterol))
5.4
Oth
erFunctions
�supsmu:Friedman’s“super
smoother”
�lowess:Cleveland’sscatterplotsm
oother
�glm:generalized
linearmodels(see
Glm)
�gam:Generalized
additive
models
�rpart:Likeoriginal
CART
withsurrogatesplits
formissings,censored
data
extension(Atkinson&
Therneau)
�validate.rpart:in
rms;validates
recursiveparti-
tioningwithrespectto
certainaccuracy
indexes
�loess:multi-dimensionalscatterplotsm
oother
f←
loess(y∼
age
*pre
ssure
)plo
t(f)
#cross-sectional
plots
ages←
seq(20,7
0,length
=40)
pre
ssure
s←
seq(80,2
00,length
=40)
pred←
pre
dict(f,
expand.g
rid(age=
ages,
pre
ssure=pre
ssure
s))
persp(ages,
pre
ssure
s,
pred)
#3-d
plot
Chapter6
LogisticM
odelCase
Stu
dy:Surv
ivalof
Titanic
Passengers
Data
source:TheTitanic
Passenger
Listedited
byMichaelA.Findlay,
originally
published
inEaton
&Haas(1994)
Titanic:TriumphandTragedy,
Patrick
Stephens
Ltd,andexpanded
withthehelpof
theInternet
community.
Theoriginal
htmlfileswere
obtained
from
PhilipHind(1999)
(http://atschool.eduweb.co.uk/phind).
Thedataset
was
compiledandinterpretedby
Thom
asCason.Itisavailablein
R,S-Plus,andExcel
form
atsfrom
biostat.mc.vanderbilt.edu/DataSetsunder
thenam
etitanic3.
6.1
DescriptiveStatistics
require(rm
s)
getH
data
(titanic
3)
#get
dataset
from
web
site
units(titanic
3$age)←
'years
'
#List
of
names
of
variables
to
analyze
v←
c('pcla
ss','surv
ived
','age','sex
','sib
sp
','parch
')
132
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
133
latex(describe(titanic
3[,v
]),
file=
'')
titanic3[,
v]
6Variables
1309
Observations
pclass n
missing
unique
1309
03
1st
(323,25%),
2nd
(277,21%),
3rd
(709,54%)
survived:Survived
nmissing
unique
Sum
Mean
1309
02
500
0.382
age:Age[years]
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
1046
263
98
29.88
514
21
28
39
50
57
lowest
:0.1667
0.3333
0.4167
0.6667
0.7500
highest:70.500071.0000
74.0000
76.0000
80.0000
sex
nmissing
unique
1309
02
female
(466,
36%),
male
(843,
64%)
sibsp
:NumberofSiblings/SpousesAboard
nmissing
unique
Mean
1309
07
0.4989
01
23
45
8Frequency89131942
20
226
9%
68
24
32
20
1
parch
:NumberofParents/ChildrenAboard
nmissing
unique
Mean
1309
08
0.385
01
23
45
69
Frequency1002
170
113
866
22
%77
13
91
00
00
dd←
datadist(titanic
3[,v
])#
describe
distributions
of
variables
to
rms
options(datadist=
'dd
')
attach(titanic
3[,v
])options(dig
its=2)
s←
summary(surv
ived∼
age
+sex
+pcla
ss
+cut2
(sib
sp,0
:3)
+cut2
(parch,0
:3))
latex(s,
file=
'',
label=
'tita
nic−summary
.table
')
#create
LATEX
code
for
Table
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
134
Table
6.1:
Survived
N=1309
Nsurvived
Age
years
[0.167,22.0)
290
0.43
[22.000,28.5)
246
0.39
[28.500,40.0)
265
0.42
[40.000,80.0]
245
0.39
Missing
263
0.28
sex female
466
0.73
male
843
0.19
pclass
1st
323
0.62
2nd
277
0.43
3rd
709
0.26
NumberofSiblings/
SpousesAboard
0891
0.35
1319
0.51
242
0.45
[3,8]
57
0.16
NumberofParents/Childre
nAboard
01002
0.34
1170
0.59
2113
0.50
[3,9]
24
0.29
Overa
ll1309
0.38
plo
t(s,
main=
'',
subtitle
s=FALSE)
#convert
table
to
dot
plot
(Figure
6.1)
Show4-way
relationshipsaftercollapsinglevels.Sup-
pressestimates
basedon
<25
passengers.
agec←
ifels
e(age<
21,'child
','adult
')
sib
sp.p
arc
h←
paste
(ifels
e(sib
sp==0,'no
sib
/sp
ouse
','sib
/sp
ouse
'),
ifels
e(parc
h==0,'no
pare
nt/child
','pare
nt/child
'),
sep=
'/
')
g←
functio
n(y)
if(length
(y)<
25)
NA
else
mean(y)
s←
summarize(surv
ived,
llis
t(agec,
sex,
pclass,
sib
sp.p
arc
h),g
)#
llist,
summarize,
Dotplot
in
Hmisc
package
require(la
ttic
e)
#trellis
for
S-Plus
##
To
remove
color
background
from
strip
labels
do
the
following:
##
ltheme←
canonical.theme(color
=FALSE)
##
ltheme$strip.background$col←
"transparent"
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
135
Sur
vive
d
0.2
0.3
0.4
0.5
0.6
0.7
●●
●●
●
●●
●●
●
●●
●●
●●
●●
●
290
246
265
245
263
466
843
323
277
709
891
319
42
57
100
2 1
70 1
13
24
130
9 N
Mis
sing
fem
ale
m
ale
1st
2nd
3r
d
0
1
2
0
1
2
[ 0.1
67,2
2.0)
[2
2.00
0,28
.5)
[28.
500,
40.0
) [4
0.00
0,80
.0]
[3,8
]
[3,9
]
Ag
e [y
ears
]
sex
pcl
ass
Nu
mb
er o
f S
iblin
gs/
Sp
ou
ses
Ab
oar
d
Nu
mb
er o
f P
aren
ts/C
hild
ren
Ab
oar
d
Ove
rall
Figure
6.1:
Univariable
summaries
ofTitanic
survival
##
lattice.options(default.theme
=ltheme)
##
set
as
default
i←
s$agec
!='NA
'
prin
t(Dotp
lot(pcla
ss∼
surv
ived|
sib
sp.p
arc
h*agec,
gro
ups=
sex[i],
data=s,
subset=
i,
pch=
c(1,4
),
col=
c(1,1
),
xlab=
'Pro
portion
Surv
ivin
g',
par.s
trip
.text=
list(cex=
.6)))
#Figure
6.2
Key
(.0
7)
6.2
ExploringTrendswith
Nonpara
metric
Regression
#Figure
6.3
plsmo(age,
surv
ived,
datadensity=TRUE)
plsmo(age,
surv
ived,
gro
up=se
x,
datadensity=TRUE)
plsmo(age,
surv
ived,
gro
up=pclass,
datadensity=TRUE)
plsmo(age,
surv
ived,
gro
up=in
teractio
n(pclass,sex),
datadensity=TRUE,
lty=c(1,1
,1,2
,2,2
))
#Figure
6.4
plsmo(age,
surv
ived,
gro
up=cut2
(sib
sp,0
:2),
datadensity=TRUE)
plsmo(age,
surv
ived,
gro
up=cut2
(parch,0
:2),
datadensity=TRUE)
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
136
Pro
port
ion
Sur
vivi
ng
pclass
1st
2nd
3rd
0.0
0.2
0.4
0.6
0.8
1.0
no s
ib/s
pous
e / n
o pa
rent
/chi
ldad
ult
no s
ib/s
pous
e / p
aren
t/chi
ldad
ult
0.0
0.2
0.4
0.6
0.8
1.0
sib/
spou
se /
no p
aren
t/chi
ldad
ult
sib/
spou
se /
pare
nt/c
hild
adul
t
no s
ib/s
pous
e / n
o pa
rent
/chi
ldch
ild
0.0
0.2
0.4
0.6
0.8
1.0
no s
ib/s
pous
e / p
aren
t/chi
ldch
ildsi
b/sp
ouse
/ no
par
ent/c
hild
child
0.0
0.2
0.4
0.6
0.8
1.0
1st
2nd
3rd
sib/
spou
se /
pare
nt/c
hild
child
fem
ale
mal
e
Figure
6.2:
Multi-way
summary
ofTitanic
survival
6.3
Binary
LogisticM
odelwith
CasewiseDeletion
ofM
issingValues
Firstfitamodelthat
issaturatedwithrespectto
age,
sex,
pclass.Insufficientvariationin
sibsp,parchto
fitcomplex
interactions
ornonlinearities.
f1←
lrm(surv
ived∼
sex*pcla
ss*rc
s(age,5
)+
rcs(age,5
)*(sib
sp
+parc
h))
latex(anova(f1
),
file=
'',
label=
'tita
nic−anova3
')
#Table
6.2
3-way
interactions,p
archclearlyinsignificant,sodrop
f←
lrm(surv
ived∼
(sex
+pcla
ss
+rc
s(age,5
))∧2
+rc
s(age,5
)*sib
sp)
prin
t(f,
latex=TRUE)
LogisticRegressionModel
lrm(formula=
survived~(sex
+pclass
+rcs(age,5))^2+
rcs(age,
5)
*sibsp)
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
137
010
2030
4050
60
0.400.500.60
Age
, yea
rs
Survived
010
2030
4050
60
0.20.40.60.8
Age
, yea
rs
Survived
fem
ale
mal
e
010
2030
4050
60
0.20.40.60.81.0
Age
, yea
rs
Survived
1st
2nd
3rd
020
4060
80
0.00.20.40.60.81.0
Age
, yea
rs
Survived
1st.f
emal
e
2nd.
fem
ale
3rd.
fem
ale 1s
t.mal
e
2nd.
mal
e
3rd.
mal
e
Figure
6.3:
Nonparametricregression
(loess)estimatesoftherelationship
between
ageand
theprobabilityof
survivingtheTitanic.Thetopleft
panel
show
sunstratified
estimates.
Thetoprightpanel
depicts
relationships
stratified
bysex.Thebottom
left
andrightpanelsshow
respectivelyestimatesstratified
byclass
andbythecross-
classificationofsexandclass
ofthepassenger.Tickmarksare
drawnatactualagevalues
foreach
strata.
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
138
020
4060
0.20.40.60.8
Age
, yea
rs
Survived
0
1
[2,8
]
020
4060
0.350.500.65
Age
, yea
rs
Survived
0
1
[2,9
]
Figure
6.4:
Relationship
betweenageandsurvivalstratified
bythenumber
ofsiblingsorspousesonboard
(left
panel)orbythenumber
ofparents
orchildrenofthepassenger
onboard
(rightpanel)
Table
6.2:
Wald
Statisticsforsurvived
χ2
d.f.
P
sex(Factor+
Higher
Order
Factors)
187.15
15
<0.0001
AllInteractions
59.74
14
<0.0001
pclass
(Factor+
Higher
Order
Factors)
100.10
20
<0.0001
AllInteractions
46.51
18
0.0003
age(Factor+
Higher
Order
Factors)
56.20
32
0.0052
AllInteractions
34.57
28
0.1826
Nonlinear(Factor+
Higher
Order
Factors)
28.66
24
0.2331
sibsp
(Factor+
Higher
Order
Factors)
19.67
50.0014
AllInteractions
12.13
40.0164
parch(Factor+
Higher
Order
Factors)
3.51
50.6217
AllInteractions
3.51
40.4761
sex×
pclass
(Factor+
Higher
Order
Factors)
42.43
10
<0.0001
sex×
age(Factor+
Higher
Order
Factors)
15.89
12
0.1962
Nonlinear(Factor+
Higher
Order
Factors)
14.47
90.1066
NonlinearInteraction:f(A,B
)vs.AB
4.17
30.2441
pclass×
age(Factor+
Higher
Order
Factors)
13.47
16
0.6385
Nonlinear(Factor+
Higher
Order
Factors)
12.92
12
0.3749
NonlinearInteraction:f(A,B
)vs.AB
6.88
60.3324
age×
sibsp
(Factor+
Higher
Order
Factors)
12.13
40.0164
Nonlinear
1.76
30.6235
NonlinearInteraction:f(A,B
)vs.AB
1.76
30.6235
age×
parch(Factor+
Higher
Order
Factors)
3.51
40.4761
Nonlinear
1.80
30.6147
NonlinearInteraction:f(A,B
)vs.AB
1.80
30.6147
sex×
pclass×
age(Factor+
Higher
Order
Factors)
8.34
80.4006
Nonlinear
7.74
60.2581
TOTAL
NONLIN
EAR
28.66
24
0.2331
TOTAL
INTERACTIO
N75.61
30
<0.0001
TOTAL
NONLIN
EAR
+IN
TERACTIO
N79.49
33
<0.0001
TOTAL
241.93
39
<0.0001
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
139
FrequenciesofMissingValuesDuetoEach
Variable
survived
sex
pclass
age
sibsp
00
0263
0
ModelLikelihood
Discrim
ination
RankDiscrim
.Ratio
Test
Indexes
Indexes
Obs
1046
LRχ2
553.87
R2
0.555
C0.878
0619
d.f.
26g
2.427
Dxy
0.756
1427
Pr(>
χ2)<
0.0001
g r11.325
γ0.758
max|deriv|6×10
−6
g p0.365
τ a0.366
Brier
0.130
Coef
S.E.
WaldZ
Pr(>|Z|)
Intercept
3.3075
1.8427
1.79
0.0727
sex=
male
-1.1478
1.0878
-1.06
0.2914
pclass=
2nd
6.7309
3.9617
1.70
0.0893
pclass=
3rd
-1.6437
1.8299
-0.90
0.3691
age
0.0886
0.1346
0.66
0.5102
age’
-0.7410
0.6513
-1.14
0.2552
age”
4.9264
4.0047
1.23
0.2186
age”’
-6.6129
5.4100
-1.22
0.2216
sibsp
-1.0446
0.3441
-3.04
0.0024
sex=
male*pclass=
2nd
-0.7682
0.7083
-1.08
0.2781
sex=
male*pclass=
3rd
2.1520
0.6214
3.46
0.0005
sex=
male*age
-0.2191
0.0722
-3.04
0.0024
sex=
male*age’
1.0842
0.3886
2.79
0.0053
sex=
male*age”
-6.5578
2.6511
-2.47
0.0134
sex=
male*age”’
8.3716
3.8532
2.17
0.0298
pclass=
2nd*age
-0.5446
0.2653
-2.05
0.0401
pclass=
3rd*age
-0.1634
0.1308
-1.25
0.2118
pclass=
2nd*age’
1.9156
1.0189
1.88
0.0601
pclass=
3rd*age’
0.8205
0.6091
1.35
0.1780
pclass=
2nd*age”
-8.9545
5.5027
-1.63
0.1037
pclass=
3rd*age”
-5.4276
3.6475
-1.49
0.1367
pclass=
2nd*age”’
9.3926
6.9559
1.35
0.1769
pclass=
3rd*age”’
7.5403
4.8519
1.55
0.1202
age*sibsp
0.0357
0.0340
1.05
0.2933
age’
*sibsp
-0.0467
0.2213
-0.21
0.8330
age”
*sibsp
0.5574
1.6680
0.33
0.7382
age”’*sibsp
-1.1937
2.5711
-0.46
0.6425
latex(anova(f),
file=
'',
label=
'tita
nic−anova2
')
#Table
6.3
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
140
Table
6.3:
Wald
Statisticsforsurvived
χ2
d.f.
P
sex(Factor+
Higher
Order
Factors)
199.42
7<
0.0001
AllInteractions
56.14
6<
0.0001
pclass
(Factor+
Higher
Order
Factors)
108.73
12
<0.0001
AllInteractions
42.83
10
<0.0001
age(Factor+
Higher
Order
Factors)
47.04
20
0.0006
AllInteractions
24.51
16
0.0789
Nonlinear(Factor+
Higher
Order
Factors)
22.72
15
0.0902
sibsp
(Factor+
Higher
Order
Factors)
19.95
50.0013
AllInteractions
10.99
40.0267
sex×
pclass
(Factor+
Higher
Order
Factors)
35.40
2<
0.0001
sex×
age(Factor+
Higher
Order
Factors)
10.08
40.0391
Nonlinear
8.17
30.0426
NonlinearInteraction:f(A,B
)vs.AB
8.17
30.0426
pclass×
age(Factor+
Higher
Order
Factors)
6.86
80.5516
Nonlinear
6.11
60.4113
NonlinearInteraction:f(A,B
)vs.AB
6.11
60.4113
age×
sibsp
(Factor+
Higher
Order
Factors)
10.99
40.0267
Nonlinear
1.81
30.6134
NonlinearInteraction:f(A,B
)vs.AB
1.81
30.6134
TOTAL
NONLIN
EAR
22.72
15
0.0902
TOTAL
INTERACTIO
N67.58
18
<0.0001
TOTAL
NONLIN
EAR
+IN
TERACTIO
N70.68
21
<0.0001
TOTAL
253.18
26
<0.0001
Showthemanyeffects
ofpredictors.
p←
Pre
dict(f,
age,
pclass,
sex,
fun=plo
gis
)plo
t(p,
adj.subtitle=FALSE)
#Fig.
6.5
#To
take
control
of
panel
vs
groups
assignment
use:
#plot(p,∼
age
|sex,
groups='pclass',
adj.subtitle=FALSE)
plo
t(Pre
dict(f,
sib
sp,
age=
c(10,1
5,2
0,5
0),
conf.in
t=FALSE))
#Fig.
6.6
Notethat
childrenhaving
manysiblings
apparently
hadlowersurvival.Married
adultshadslightlyhigher
survivalthan
unmarried
ones.
Validatethemodel
usingthebootstrap
tocheck
overfitting.
Ignoring
twovery
insignificant
pooled
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
141
Age
, yea
rs
0.2
0.4
0.6
0.8
020
4060fe
mal
e
mal
e
1st
020
4060
fem
ale
mal
e
2nd
020
4060
fem
ale
mal
e
3rd
Figure
6.5:
Effects
ofpredictors
onprobabilityofsurvivalofTitanic
passengers,
estimatedforzero
siblingsor
spouses.
Lines
forfemalesare
black;malesare
drawnusinggrayscale.
Ad
just
ed t
o:s
ex=m
ale
pcl
ass=
3rd
N
umbe
r of
Sib
lings
/Spo
uses
Abo
ard
log odds
−5
−4
−3
−2
−1
02
46
8
age:
10ag
e:15
age:
20
02
46
8
−5
−4
−3
−2
−1
age:
50
Figure
6.6:
Effectofnumber
ofsiblingsandspousesonthelogoddsofsurviving,forthirdclass
males.
Numbers
nextto
lines
are
ages
inyears.
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
142
tests.
f←
update
(f,
x=TRUE,
y=TRUE)
#x=TRUE,
y=TRUE
adds
raw
data
to
fit
object
so
can
bootstrap
set.seed(1
31)
#so
can
replicate
re-samples
latex(validate(f,B=80),
dig
its=2,
siz
e=
'Ssize
')
Index
Original
Training
Test
Optimism
Corrected
nSam
ple
Sam
ple
Sam
ple
Index
Dxy
0.76
0.77
0.74
0.03
0.72
80R
20.55
0.58
0.53
0.05
0.50
80Intercept
0.00
0.00
−0.09
0.09
−0.09
80Slope
1.00
1.00
0.86
0.14
0.86
80E
max
0.00
0.00
0.05
0.05
0.05
80D
0.53
0.56
0.49
0.07
0.46
80U
0.00
0.00
0.01
−0.01
0.01
80Q
0.53
0.56
0.49
0.08
0.45
80B
0.13
0.12
0.13
−0.01
0.14
80g
2.43
2.79
2.38
0.40
2.02
80g p
0.37
0.37
0.35
0.02
0.35
80
cal←
calibrate(f,B=80)
#Figure
6.7
plo
t(cal)
n=1046
Mean
absolu
te
error=0.012
Mean
square
derror=0.00018
0.9
Quantile
of
absolu
te
error=0.018
But
moderateproblem
withmissing
data
6.4
ExaminingM
issingData
Pattern
s
na.p
attern
s←
naclu
s(titanic
3)
require(rp
art
)#
Recursive
partitioning
package
who.na←
rpart
(is.n
a(age)∼
sex
+pcla
ss
+surv
ived
+sib
sp
+parch,
minbucket=15)
naplot(na.p
attern
s,
'na
per
var')
plo
t(na.p
attern
s)
options(dig
its=5)
plo
t(who.na,
marg
in=.1
);
text(who.na)
#Figure
6.8
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
143
0.0
0.2
0.4
0.6
0.8
1.0
0.00.20.40.60.81.0
Pre
dict
ed P
r{su
rviv
ed=
1}
Actual Probability
Mea
n ab
solu
te e
rror
=0.
012
n=10
46B
= 8
0 re
petit
ions
, boo
t
App
aren
tB
ias−
corr
ecte
dId
eal
Figure
6.7:
Bootstrapoverfitting-correctedloessnonparametriccalibrationcurveforcasewisedeletionmodel
plo
t(summary(is.n
a(age)∼
sex
+pcla
ss
+surv
ived
+sib
sp
+parc
h))
#Figure
6.9
m←
lrm(is.n
a(age)∼
sex
*pcla
ss
+surv
ived
+sib
sp
+parc
h)
prin
t(m
,la
tex=TRUE)
LogisticRegressionModel
lrm(formula=
is.na(age)
~sex*
pclass+survived
+sibsp+
parch)
ModelLikelihood
Discrim
ination
RankDiscrim
.Ratio
Test
Indexes
Indexes
Obs
1309
LRχ2
114.99
R2
0.133
C0.703
FALSE
1046
d.f.
8g
1.015
Dxy
0.406
TRUE
263
Pr(>
χ2)<
0.0001
g r2.759
γ0.452
max|deriv|5×10
−6
g p0.126
τ a0.131
Brier
0.148
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
144
pcla
sssu
rviv
edna
me
sex
sibs
ppa
rch
ticke
tca
bin
boat
fare
emba
rked
age
hom
e.de
stbo
dy
0.0
0.2
0.4
0.6
0.8
Fra
ctio
n o
f N
As
in e
ach
Var
iab
le
Fra
ctio
n of
NA
s
boat
embarked
cabin
fare
ticket
parch
sibsp
age
body
home.dest
sex
name
pclass
survived
0.40.20.0
Fraction Missing
|pc
lass
=ab
parc
h>=
0.5
0.09
2
0.18
0.32
Figure
6.8:
Patternsofmissingdata.Upper
leftpanel
show
sthefractionofobservationsmissingoneach
predictor.
Upper
rightpanel
depicts
ahierarchicalcluster
analysisofmissingnesscombinations.
Thesimilarity
measure
show
nontheY-axisisthefractionofobservationsforwhichboth
variablesare
missing.Low
erleftpanel
show
stheresultof
recursivepartitioningforpredictingis.na(age).Therpartfunctionfoundonly
strongpatternsaccordingto
passenger
class.
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
145
is.n
a(ag
e)
0.0
0.2
0.4
0.6
0.8
1.0
●●
●●
●
●●
●● ●
●● ●
●
●●
●●
●● ●
●
●
466
843
323
277
709
809
500
891
319
42
20
22 6
9
100
2 1
70 1
13 8
6
6
2
2
130
9 N
fem
ale
m
ale
1st
2nd
3r
d
No
Ye
s
0
1
2
3
4
5
8
0
1
2
3
4
5
6
9
sex
pcl
ass
Su
rviv
ed
Nu
mb
er o
f S
iblin
gs/
Sp
ou
ses
Ab
oar
d
Nu
mb
er o
f P
aren
ts/C
hild
ren
Ab
oar
d
Ove
rall
mea
n
Figure
6.9:
Univariable
descriptionsofproportionofpassengerswithmissingage
Coef
S.E.
WaldZ
Pr(>|Z|)
Intercept
-2.2030
0.3641
-6.05
<0.0001
sex=
male
0.6440
0.3953
1.63
0.1033
pclass=
2nd
-1.0079
0.6658
-1.51
0.1300
pclass=
3rd
1.6124
0.3596
4.48
<0.0001
survived
-0.1806
0.1828
-0.99
0.3232
sibsp
0.0435
0.0737
0.59
0.5548
parch
-0.3526
0.1253
-2.81
0.0049
sex=
male*pclass=
2nd
0.1347
0.7545
0.18
0.8583
sex=
male*pclass=
3rd
-0.8563
0.4214
-2.03
0.0422
latex(anova(m
),
file=
'',
label=
'tita
nic−anova.n
a')
#Table
6.4
pclassand
parcharetheimportant
predictors
ofmissing
age.
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
146
Table
6.4:
Wald
Statisticsforis.na(age)
χ2
d.f.
P
sex(Factor+
Higher
Order
Factors)
5.61
30.1324
AllInteractions
5.58
20.0614
pclass
(Factor+
Higher
Order
Factors)
68.43
4<
0.0001
AllInteractions
5.58
20.0614
survived
0.98
10.3232
sibsp
0.35
10.5548
parch
7.92
10.0049
sex×
pclass
(Factor+
Higher
Order
Factors)
5.58
20.0614
TOTAL
82.90
8<
0.0001
6.5
Single
ConditionalM
ean
Imputation
First
try:
conditionalmeanimputation
Defaultsplinetransformationforagecaused
distri-
bution
ofimputedvalues
tobemuchdifferentfrom
non-imputedones;constrainto
linear
xtra
ns←
transcan(∼
I(age)+
sex
+pcla
ss
+sib
sp
+parch,
imputed=TRUE,
pl=FALSE,
pr=
FALSE,
data=titanic
3)
summary(xtra
ns)
transcan(x
=∼I(age)+
sex
+pcla
ss
+sib
sp
+parch,
imputed
=TRUE,
pr=
FALSE,
pl=
FALSE,
data
=titanic
3)
Iteratio
ns:
5
R2
achieved
inpredic
tin
geach
varia
ble
:
age
sex
pcla
ss
sib
sp
parc
h0.258
0.078
0.244
0.241
0.288
Adju
sted
R2:
age
sex
pcla
ss
sib
sp
parc
h0.254
0.074
0.240
0.238
0.285
Coeffic
ients
of
canonical
varia
tes
for
predic
tin
geach
(row)
varia
ble
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
147
age
sex
pcla
ss
sib
sp
parc
hage
0.8
9−6.13−1.81−2.77
sex
0.0
20.5
6−0.10−0.71
pcla
ss−0.08
0.2
6−0.07−0.25
sib
sp−0.02−0.04−0.07
0.8
7parch−0.03−0.29−0.22
0.7
5
Summary
of
imputed
valu
es
age
nmissin
gunique
Mean
.05
.10
.25
263
024
28.41
16.76
21.66
26.17
.50
.75
.90
.95
28.04
28.04
42.92
42.92
lowest
:7.563
9.425
14.617
16.479
16.687
hig
hest:
33.219
34.749
38.588
41.058
42.920
Start
ing
estim
ates
for
imputed
valu
es:
age
sex
pcla
ss
sib
sp
parc
h28
23
00
#Look
at
mean
imputed
values
by
sex,pclass
and
observed
means
#age.i
is
age,
filled
in
with
conditional
mean
estimates
age.i←
impute
(xtrans,
age,
data=titanic
3)
i←
is.imputed(age.i)
tapply
(age.i
[i],
list(sex[i],pcla
ss[i])
,mean)
1st
2nd
3rd
female
39.137
31.357
22.926
male
42.920
33.219
26.715
tapply
(age,
list(se
x,pcla
ss),
mean,
na.rm=TRUE)
1st
2nd
3rd
female
37.038
27.499
22.185
male
41.029
30.815
25.962
dd
←datadist(dd,
age.i)
f.si←
lrm(surv
ived∼
(sex
+pcla
ss
+rc
s(age.i
,5))∧2
+rc
s(age.i
,5)*sib
sp)
prin
t(f.si,
coefs=FALSE,
latex=TRUE)
LogisticRegressionModel
lrm(formula=
survived~(sex
+pclass
+rcs(age.i,5))^2+rcs(age.i,
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
148
Table
6.5:
Wald
Statisticsforsurvived
χ2
d.f.
P
sex(Factor+
Higher
Order
Factors)
245.53
7<
0.0001
AllInteractions
52.80
6<
0.0001
pclass
(Factor+
Higher
Order
Factors)
112.02
12
<0.0001
AllInteractions
36.77
10
0.0001
age.i(Factor+
Higher
Order
Factors)
49.25
20
0.0003
AllInteractions
25.53
16
0.0610
Nonlinear(Factor+
Higher
Order
Factors)
19.86
15
0.1772
sibsp
(Factor+
Higher
Order
Factors)
21.74
50.0006
AllInteractions
12.25
40.0156
sex×
pclass
(Factor+
Higher
Order
Factors)
30.25
2<
0.0001
sex×
age.i(Factor+
Higher
Order
Factors)
8.95
40.0622
Nonlinear
5.63
30.1308
NonlinearInteraction:f(A,B
)vs.AB
5.63
30.1308
pclass×
age.i(Factor+
Higher
Order
Factors)
6.04
80.6427
Nonlinear
5.44
60.4882
NonlinearInteraction:f(A,B
)vs.AB
5.44
60.4882
age.i×
sibsp
(Factor+
Higher
Order
Factors)
12.25
40.0156
Nonlinear
2.04
30.5639
NonlinearInteraction:f(A,B
)vs.AB
2.04
30.5639
TOTAL
NONLIN
EAR
19.86
15
0.1772
TOTAL
INTERACTIO
N66.83
18
<0.0001
TOTAL
NONLIN
EAR
+IN
TERACTIO
N69.48
21
<0.0001
TOTAL
305.58
26
<0.0001
5)
*sibsp)
ModelLikelihood
Discrim
ination
RankDiscrim
.Ratio
Test
Indexes
Indexes
Obs
1309
LRχ2
641.01
R2
0.526
C0.861
0809
d.f.
26g
2.227
Dxy
0.722
1500
Pr(>
χ2)<
0.0001
g r9.272
γ0.728
max|deriv|4×10
−4
g p0.346
τ a0.341
Brier
0.133
p1←
Pre
dict(f,
age,
pclass,
sex,
fun=plo
gis
)p2←
Pre
dict(f.si,
age.i
,pclass,
sex,
fun=plo
gis
)p←
rbin
d('Case
wise
Deletion
'=p1,
'Sin
gle
Imputa
tion
'=p2,
rename=
c(age.i=
'age'))
#creates
.set.
variable
plo
t(p,∼
age|
pcla
ss*.set.,
gro
ups=
'sex
',
ylab=
'Probability
of
Surv
ivin
g',
adj.subtitle=FALSE)
#Figure
6.10
latex(anova(f.si),
file=
'',
label=
'titanic−anova.si')
#Table
6.5
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
149
Age
, yea
rs
Probability of Surviving
0.2
0.4
0.6
0.8
020
4060fem
ale
mal
e
1st
Cas
ewis
e D
elet
ion
fem
ale
mal
e
2nd
Cas
ewis
e D
elet
ion
020
4060
fem
ale
mal
e
3rd
Cas
ewis
e D
elet
ion
fem
ale
mal
e
1st
Sin
gle
Impu
tatio
n
020
4060
fem
ale
mal
e
2nd
Sin
gle
Impu
tatio
n
0.2
0.4
0.6
0.8
fem
ale
mal
e
3rd
Sin
gle
Impu
tatio
n
Figure
6.10:
Predicted
probabilityofsurvivalformalesfrom
fitusingcasewisedeletion
(leftpanel)and
single
conditionalmeanim
putation(rightpanel).
sibspis
setto
zero
forthesepredictedvalues.
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
150
6.6
Multiple
Imputation
The
followinguses
aregImputewithpredictive
mean
matching.
Bydefault,
aregImputedoes
nottrans-
form
agewhenitisbeing
predictedfrom
theother
variables.
Fourknotsareused
totransform
agewhen
used
toimpute
othervariables(not
needed
here
asno
othermissingswerepresent).
set.seed(17)
#so
can
reproduce
random
aspects
mi←
are
gIm
pute
(∼age
+sex
+pcla
ss
+sib
sp
+parc
h+
surv
ived,
n.impute
=5,nk=4,
pr=
FALSE)
mi
Multip
leIm
puta
tion
usin
gBootstrap
and
PMM
are
gIm
pute
(fo
rmula
=∼age
+sex
+pcla
ss
+sib
sp
+parc
h+
surv
ived,
n.im
pute
=5,
nk
=4,
pr=
FALSE)
n:
1309
p:
6Im
puta
tions:
5nk:
4
Number
of
NAs:
age
sex
pcla
ss
sib
sp
parch
surv
ived
263
00
00
0
type
d.f.
age
s1
sex
c1
pcla
ss
c2
sib
sp
s2
parch
s2
surv
ived
l1
Tra
nsfo
rmation
of
Targ
et
Variables
Forced
tobe
Lin
ear
R−square
sfo
rPre
dictin
gNon−
Missing
Values
for
Each
Variable
Using
Last
Imputa
tions
of
Pre
dictors
age
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
151
0.344
the
5imputations
for
the
first
10
passengers
#having
missing
age
mi$im
puted$age[1
:10,]
[,1
][,2
][,3
][,4
][,5
]16
28.5
60.0
32.5
46
71
38
26.0
26.0
29.0
49
51
41
47.0
62.0
47.0
55
42
47
45.0
47.0
17.0
46
39
60
39.0
27.0
42.0
39
18
70
39.0
39.0
23.0
30
41
71
29.0
42.0
47.0
47
61
75
46.0
28.5
32.5
17
36
81
47.0
48.0
30.0
55
40
107
62.0
50.0
23.0
33
17
Showthedistribution
ofimputed(black)andactual
ages
(gray).
plo
t(mi)
Ecdf(age,
add=
TRUE,
col=
'gra
y',
lwd=2,
subtitle
s=FALSE)
#Figure
6.11
020
4060
80
0.00.20.40.60.81.0
Impu
ted
age
Proportion <= x
Figure
6.11:
Distributionsofim
putedandactualages
fortheTitanic
dataset
Fitlogisticmodelsfor5completed
datasetsandprint
theratioof
imputation-corrected
variancesto
aver-
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
152
Table
6.6:
Wald
Statisticsforsurvived
χ2
d.f.
P
sex(Factor+
Higher
Order
Factors)
236.24
7<
0.0001
AllInteractions
52.20
6<
0.0001
pclass
(Factor+
Higher
Order
Factors)
109.82
12
<0.0001
AllInteractions
37.09
10
0.0001
age(Factor+
Higher
Order
Factors)
49.09
20
0.0003
AllInteractions
22.73
16
0.1211
Nonlinear(Factor+
Higher
Order
Factors)
21.38
15
0.1251
sibsp
(Factor+
Higher
Order
Factors)
23.68
50.0003
AllInteractions
11.00
40.0266
sex×
pclass
(Factor+
Higher
Order
Factors)
33.48
2<
0.0001
sex×
age(Factor+
Higher
Order
Factors)
9.22
40.0559
Nonlinear
7.18
30.0663
NonlinearInteraction:f(A,B
)vs.AB
7.18
30.0663
pclass×
age(Factor+
Higher
Order
Factors)
3.66
80.8861
Nonlinear
3.27
60.7739
NonlinearInteraction:f(A,B
)vs.AB
3.27
60.7739
age×
sibsp
(Factor+
Higher
Order
Factors)
11.00
40.0266
Nonlinear
1.90
30.5925
NonlinearInteraction:f(A,B
)vs.AB
1.90
30.5925
TOTAL
NONLIN
EAR
21.38
15
0.1251
TOTAL
INTERACTIO
N65.11
18
<0.0001
TOTAL
NONLIN
EAR
+IN
TERACTIO
N68.89
21
<0.0001
TOTAL
302.90
26
<0.0001
ageordinary
variances
f.mi←
fit.m
ult.impute(surv
ived∼
(sex
+pcla
ss
+rc
s(age,5
))∧2
+rc
s(age,5
)*sib
sp,
lrm
,mi,
data=titanic3,
pr=
FALSE)
latex(anova(f.mi),
file=
'',
label=
'tita
nic−anova.m
i')
#Table
6.6
The
Waldχ2forageisreducedby
accounting
for
imputation
butisincreasedby
usingpatterns
ofas-
sociationwithsurvivalstatus
toimpute
missing
age.
Showestimated
effects
ofageby
classes.
p1←
Pre
dict(f.si,
age.i
,pclass,
sex,
fun=plo
gis
)p2←
Pre
dict(f.mi,
age,
pclass,
sex,
fun=plo
gis
)p←
rbin
d('Sin
gle
Imputa
tion
'=p1,
'M
ultip
leIm
puta
tion
'=p2,
rename=
c(age.i=
'age'))
plo
t(p,∼
age|
pcla
ss*.set.,
gro
ups=
'sex
',
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
153
ylab=
'Probability
of
Surv
ivin
g',
adj.subtitle=FALSE)
#Figure
6.12
Age
, yea
rs
Probability of Surviving
0.2
0.4
0.6
0.8
020
4060fem
ale
mal
e
1st
Mul
tiple
Impu
tatio
n
fem
ale
mal
e
2nd
Mul
tiple
Impu
tatio
n
020
4060
fem
ale
mal
e
3rd
Mul
tiple
Impu
tatio
n
fem
ale
mal
e
1st
Sin
gle
Impu
tatio
n
020
4060
fem
ale
mal
e
2nd
Sin
gle
Impu
tatio
n
0.2
0.4
0.6
0.8
fem
ale
mal
e
3rd
Sin
gle
Impu
tatio
n
Figure
6.12:
Predictedprobabilityofsurvivalformalesfrom
fitusingsingle
conditionalmeanim
putationagain
(left
panel)andmultiple
random
draw
imputation(rightpanel).
Both
sets
ofpredictionsare
forsibsp=0.
6.7
Summarizingth
eFitted
Model
Showodds
ratios
forchangesin
predictorvalues
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
154
s←
summary(f.mi,
age=
c(1,3
0),
sib
sp=0:1)
#override
default
ranges
for
3variables
plo
t(s,
log=TRUE,
main=
'')
#Figure
6.13
0.1
0 0
.50
2.0
0 6
.00
0.99
0.9
0.7
0.8
0.95
age
− 3
0:1
sibs
p −
1:0
sex
− fe
mal
e:m
ale
pcla
ss −
1st
:3rd
pcla
ss −
2nd
:3rd
Adj
uste
d to
:sex
=m
ale
pcla
ss=
3rd
age=
28 s
ibsp
=0
Figure
6.13:
Oddsratiosforsomepredictorsettings
Get
predictedvalues
forcertaintypes
ofpassengers
phat←
pre
dict(f.mi,
combos←
expand.g
rid(age=
c(2,2
1,5
0),sex=
levels
(sex),
pcla
ss=
levels
(pcla
ss),
sib
sp=0),
type=
'fitted
')
#Can
also
use
Predict(f.mi,
age=c(2,21,50),
sex,
pclass,
#sibsp=0,
fun=plogis)$yhat
options(dig
its=1)
data
.fra
me(co
mbos,
phat)
age
sex
pcla
ss
sib
sp
phat
12
female
1st
00.9
82
21
female
1st
00.9
83
50
female
1st
00.9
74
2male
1st
00.8
85
21
male
1st
00.4
66
50
male
1st
00.2
77
2female
2nd
01.0
08
21
female
2nd
00.9
09
50
female
2nd
00.8
310
2male
2nd
01.0
011
21
male
2nd
00.0
8
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
155
12
50
male
2nd
00.0
413
2female
3rd
00.8
414
21
female
3rd
00.5
715
50
female
3rd
00.3
716
2male
3rd
00.8
917
21
male
3rd
00.1
418
50
male
3rd
00.0
5
options(dig
its=5)
Wecanalso
getpredictedvalues
bycreating
anS
function
that
willevaluate
themodelon
demand.
pred.logit←
Function(f.mi)
#Note:
if
don't
define
sibsp
to
pred.logit,
defaults
to
0
#normally
just
type
the
function
name
to
see
its
body
latex(pred.logit
,file=
'',
type=
'Sinput',
siz
e=
'small
')
pred.logit←
functio
n(sex
=”male
”,
pcla
ss
=”3rd
”,
age
=28,
sib
sp
=0)
{3.5
810728−
1.2
694669
*(sex
==
”male
”)
+5.2
27106
*(pcla
ss
==
”2nd”)−
1.7
471648
*(pcla
ss
==
”3rd
”)
+0.0
72213655
*age−
0.0
0021294639
*
pmax(age−
4,
0)∧3
+0.0
015984839
*pmax(age−
21,
0)∧3−
0.0
023265999
*
pmax(age−
28,
0)∧3
+0.0
010212127
*pmax(age−
36.1
5,
0)∧3−
8.0150336e−05
*
pmax(age−
56,
0)∧3−
1.1
339431
*sib
sp
+(sex
==
”male
”)
*(−0.46284486
*
(pcla
ss
==
”2nd”)
+2.0
884806
*(pcla
ss
==
”3rd
”))
+(sex
==
”male
”)
*
(−0.22398928
*age
+0.0
003578076
*pmax(age−
4,
0)∧3−
0.0
02354863
*
pmax(age−
21,
0)∧3
+0.0
032067241
*pmax(age−
28,
0)∧3−
0.0
013085171
*pmax(age−
36.1
5,
0)∧3
+9.8848428e−05
*pmax(age−
56,
0)∧3)
+(pcla
ss
==
”2nd”)
*(−0.4600114
*age
+0.0
0052411339
*
pmax(age−
4,
0)∧3−
0.0
025239553
*pmax(age−
21,
0)∧3
+0.0
026577424
*
pmax(age−
28,
0)∧3−
0.0
0067164981
*pmax(age−
36.1
5,
0)∧3
+1.3749304e−05
*pmax(age−
56,
0)∧3)
+(pcla
ss
==
”3rd
”)
*(−0.14784979
*
age
+0.0
0021831279
*pmax(age−
4,
0)∧3−
0.0
01437761
*pmax(age−
21,
0)∧3
+0.0
020012161
*pmax(age−
28,
0)∧3−
0.0
0085968161
*
pmax(age−
36.1
5,
0)∧3
+7.7913743e−05
*pmax(age−
56,
0)∧3)
+sib
sp
*(0.0
45169115
*age−
2.90579e−05
*pmax(age−
4,
0)∧3
+0.0
0025289589
*pmax(age−
21,
0)∧3−
0.0
0048983359
*pmax(age−
28,
0)∧3
+0.0
0032115845
*pmax(age−
36.1
5,
0)∧3−
5.5162848e−05
*
pmax(age−
56,
0)∧3)
}
#Run
the
newly
created
function
plo
gis
(pred.logit
(age=
c(2,2
1,5
0),
sex=
'male
',
pcla
ss=
'3rd
'))
[1]
0.886318
0.135294
0.054266
CHAPTER
6.
LOGISTIC
MODELCASE
STUDY:SURVIV
ALOFTIT
ANIC
PASSENGERS
156
Anomogram
couldbeused
toobtain
predictedval-
uesmanually,butthisisnotfeasiblewhenso
many
interactionterm
sarepresent.
R/S-PlusSoftwareUsed
Package
Purpose
Functions
Hmisc
Miscellaneousfunctions
summary,plsmo,naclus,llist,latex
summarize,Dotplot,describe,dataRep
Hmisc
Imputation
transcan,impute,fit.mult.impute,aregImpute
rms
Modeling
datadist,lrm,rcs
Modelpresentation
plot,summary,nomogram,Function
Modelvalidation
validate,calibrate
rparta
Recursivepartitioning
rpart
aW
ritten
byAtkinson&
Thernea
u
Chapter7
Case
Stu
dyin
Para
metric
Surv
ival
Modelingand
ModelApproxim
ation
Data
source:Random
sampleof
1000
patients
from
PhasesI&
IIof
SUPPORT
(Study
toUn-
derstand
Prognoses
Preferences
Outcomes
andRisks
ofTreatment,funded
bytheRobertWoodJohnson
Foundation).See
70.The
datasetisavailablefrom
http://biostat.mc.vanderbilt.edu/DataSets.
�Analyze
acutediseasesubset
ofSUPPORT(acute
respiratoryfailure,multipleorgansystem
failure,
coma)
—theshapeof
thesurvivalcurves
isdiffer-
entbetweenacuteandchronicdiseasecategories
157
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N158
�Patientshadto
surviveuntilday3of
thestudyto
qualify
�Baselinephysiologicvariablesmeasuredduring
day
3
7.1
DescriptiveStatistics
Createavariable
acuteto
flag
categories
ofinterest;
printunivariabledescriptivestatistics.
require(rm
s)
getH
data
(support
)#
Get
data
frame
from
web
site
acute←
support$dzcla
ss
%in%
c('ARF/M
OSF
','Com
a')
latex(describe(support
[acute
,]),
file=
'')
support[acu
te,]
35Variables
537
Observations
age:Age
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
0529
60.7
28.49
35.22
47.93
63.67
74.49
81.54
85.56
lowest
:18.0418.41
19.76
20.30
20.31
highest:91.62
91.8291.93
92.7495.51
death
:Death
atanytimeupto
NDIdate:31DEC94
nmissing
unique
Sum
Mean
537
02
356
0.6629
sex
nmissing
unique
537
02
female
(251,
47%),
male
(286,
53%)
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N159
hosp
dead:Death
inHospital
nmissing
unique
Sum
Mean
537
02
201
0.3743
slos:Daysfrom
StudyEntryto
Disch
arge
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
085
23.44
4.0
5.0
9.0
15.0
27.0
47.4
68.2
lowest
:3
45
67,
highest:145164202236241
d.tim
e:DaysofFollow-U
p
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
0340
446.1
46
16
182
724
1421
1742
lowest
:3
45
67,
highest:1977
1979
19822011
2022
dzg
roup
nmissing
unique
537
03
ARF/MOSFw/Sepsis
(391,73%),
Coma
(60,11%),
MOSF
w/Malig(86,
16%)
dzclass
nmissing
unique
537
02
ARF/MOSF(477,
89%),Coma
(60,
11%)
num.co:numberofco
morbidities
nmissing
unique
Mean
537
07
1.525
01
23
45
6Frequency11119613351
31
105
%21
36
25
96
21
edu:YearsofEduca
tion
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
411
126
22
12.03
78
10
12
14
16
17
lowest
:0
12
34,
highest:17
18
1920
22
inco
me
nmissing
unique
335
202
4
under$11k
(158,
47%),
$11-$25k
(79,24%),
$25-$50k(63,
19%)
>$50k(35,
10%)
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N160
scoma:SUPPORT
ComaSco
rebasedonGlasgow
D3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
011
19.24
00
00
37
55
100
0926
37
4144
55
6189
94100
Frequency30150
44
1917
43
11
68
632
%56
98
43
82
11
16
charges:HospitalCharges
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
517
20
516
86652
11075
15180
27389
51079
100904
205562
283411
lowest
:3448
4432
4574
5555
5849
highest:504660
538323
543761
706577
740010
totcst
:TotalRCC
cost
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
471
66
471
46360
6359
8449
15412
29308
57028
108927
141569
lowest
:0
2071
2522
3191
3325
highest:269057
269131
338955
357919
390460
totm
cst:Totalmicro-cost
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
331
206
328
39022
6131
8283
14415
26323
54102
87495
111920
lowest
:0
1562
2478
2626
3421
highest:144234
154709
198047
234876
271467
avtisst:AverageTISS,Days3-25
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
536
1205
29.83
12.46
14.50
19.62
28.00
39.00
47.17
50.37
lowest
:4.000
5.667
8.000
9.000
9.500
highest:58.500
59.000
60.000
61.000
64.000
race
nmissing
unique
535
25
whiteblack
asianother
hispanic
Frequency
417
84
48
22
%78
16
11
4
meanbp:MeanArterialBloodPressure
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
0109
83.28
41.8
49.0
59.0
73.0
111.0
124.4
135.0
lowest
:0
20
27
30
32,
highest:
155
158
161162180
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N161
wblc
:W
hiteBloodCellCountDay
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
532
5241
14.1
0.8999
4.5000
7.9749
12.3984
18.1992
25.1891
30.1873
lowest
:0.05000
0.06999
0.09999
0.14999
0.19998
highest:
51.39844
58.19531
61.19531
79.39062100.00000
hrt
:HeartRate
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
0111
105
51
60
75
111
126
140
155
lowest
:0
11
30
36
40,
highest:
189
193
199232300
resp
:RespirationRate
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
045
23.72
810
12
24
32
39
40
lowest
:0
46
78,
highest:48
49
5260
64
temp:Temperature
(celcius)
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
061
37.52
35.50
35.80
36.40
37.80
38.50
39.09
39.50
lowest
:32.5034.00
34.09
34.90
35.00
highest:40.20
40.5940.90
41.0041.20
pafi:PaO2/(.01*FiO
2)Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
500
37
357
227.2
86.99
105.08
137.88
202.56
290.00
390.49
433.31
lowest
:45.00
48.00
53.33
54.00
55.00
highest:574.00
595.12
640.00
680.00
869.38
alb
:Serum
Albumin
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
346
191
34
2.668
1.700
1.900
2.225
2.600
3.100
3.400
3.800
lowest
:1.1001.200
1.300
1.400
1.500
highest:4.100
4.1994.500
4.6994.800
bili:Bilirubin
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
386
151
88
2.678
0.3000
0.4000
0.6000
0.8999
2.0000
6.5996
13.1743
lowest
:0.09999
0.19998
0.29999
0.39996
0.50000
highest:22.59766
30.0000031.50000
35.0000039.29688
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N162
crea:Serum
creatinineDay
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
084
2.232
0.6000
0.7000
0.8999
1.3999
2.5996
5.2395
7.3197
lowest
:0.3
0.4
0.5
0.6
0.7,
highest:10.4
10.6
11.211.6
11.8
sod:Serum
sodium
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
038
138.1
129
131
134
137
142
147
150
lowest
:118
120
121
126
127,
highest:156157158168175
ph:Serum
pH
(arterial)
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
500
37
49
7.416
7.270
7.319
7.380
7.420
7.470
7.510
7.529
lowest
:6.9606.989
7.069
7.119
7.130
highest:7.560
7.5697.590
7.6007.659
gluco
se:Gluco
seDay
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
297
240
179
167.7
76.0
89.0
106.0
141.0
200.0
292.4
347.2
lowest
:30
42
52
55
68,highest:
446
468
492
576
598
bun:BUN
Day
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
304
233
100
38.91
8.00
11.00
16.75
30.00
56.00
79.70
100.70
lowest
:1
34
56,
highest:123124125128146
urine:UrineOutputDay
3
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
303
234
262
2095
20.3
364.0
1156.5
1870.0
2795.0
4008.6
4817.5
lowest
:0
58
15
20,
highest:
68656920
7360
7560
7750
adlp
:ADLPatientDay
3
nmissing
unique
Mean
104
433
81.577
012
345
67
Frequency51
19
764
782
%49
187
647
82
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N163
adls
:ADLSurrogate
Day
3
nmissing
unique
Mean
392
145
81.86
01
23
45
67
Frequency18568
22
1817
20
3923
%47
17
65
45
10
6
sfdm2
nmissing
unique
468
69
5
no(M2andSIP
pres)
(134,29%),
adl>=4
(>=5
if
sur)
(78,
17%)
SIP>=30
(30,
6%),Coma
orIntub
(5,
1%),
<2
mo.
follow-up(221,
47%)
adlsc:Im
putedADLCalibratedto
Surrogate
nmissing
unique
Mean
.05
.10
.25
.50
.75
.90
.95
537
0144
2.119
0.000
0.000
0.000
1.839
3.375
6.000
6.000
lowest
:0.0000
0.4948
0.4948
1.00001.1667
highest:5.7832
6.0000
6.3398
6.4658
7.0000
#Show
patterns
of
missing
data
plo
t(naclu
s(support
[acute
,]))
#Figure
7.1
Showassociations
betweenpredictorsusingageneral
non-monotonic
measure
ofdependence(H
oeffding
D). ac←
support
[acute
,]ac$dzgro
up←
ac$dzgro
up[dro
p=TRUE]
#Remove
unused
levels
attach(ac)
vc←
varc
lus(∼
age+
sex+
dzgro
up+num.co+
edu+
income+
scoma+
race+
mea
nbp+
wblc+hrt+re
sp+temp+
pafi+alb+bili+
cre
a+so
d+
ph+
glu
cose+
bun+
urine+adlsc,
sim=
'hoeffdin
g')
plo
t(vc)
#Figure
7.2
7.2
Check
ingAdequacy
ofLog-N
orm
alAcc
elera
tedFailure
Tim
eM
odel
dd←
datadist(ac)
#describe
distributions
of
variables
to
rms
options(datadist=
'dd
')
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N164
adlscsod
creatempresp
hrtmeanbp
raceavtisst
wblccharges
totcstscoma
pafiphsfdm2
albbili
totmcstadlp
urineglucose
bunadls
eduincome
num.codzclass
dzgroupd.time
sloshospdead
sexage
death
0.50.40.30.20.10.0
Fraction Missing
Figure
7.1:
Cluster
analysisshow
ingwhichpredictors
tendto
bemissingonthesamepatients
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N165
eduincome$25−$50kincome$11−$25k
income>$50kadlsc
num.coglucose
albphpafi
meanbpurine
resphrt
tempagebili
creabun
sexmalesod
raceasianraceother
racehispanicraceblack
dzgroupComascoma
dzgroupMOSF w/Maligwblc
0.35 0.25 0.15 0.05
30 * Hoeffding D
Figure
7.2:
Hierarchicalclusteringofpotentialpredictors
usingHoeff
dingD
asasimilarity
measure.Categorical
predictors
are
automaticallyexpanded
into
dummyvariables.
#Generate
right-censored
survival
time
variable
years←
d.tim
e/365.2
5units(years
)←
'Year'
S←
Surv
(years
,death
)#
Show
normal
inverse
Kaplan-Meier
estimates
#stratified
by
dzgroup
surv
plo
t(survfit(S∼
dzgro
up),
conf=
'none',
fun=qnorm
,lo
gt=
TRUE)
#Figure
7.3
Morestringentassessmentof
log-norm
alassump-
tions:
checkdistribution
ofresidualsfrom
anad-
justed
model:
f←
psm
(S∼
dzgro
up
+rc
s(age,5
)+
rcs(meanbp,5
),
dist=
'lognorm
al',
y=TRUE)
#dist='gaussian'
for
S+
r←
resid
(f)
surv
plo
t(r,
dzgro
up,
label.curve=FALSE)
surv
plo
t(r,
age,
label.curve=FALSE)
surv
plo
t(r,
meanbp,
label.curve=FALSE)
random.number←
runif(length
(age))
surv
plo
t(r,
random.number,
label.curve=FALSE)
#Figure
7.4
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N166
log
Sur
viva
l Tim
e in
Yea
rs
−3
−2
−1
01
2
−2−1012
dzgr
oup=
AR
F/M
OS
F w
/Sep
sis
dzgr
oup=
Com
a
dzgr
oup=
MO
SF
w/M
alig
Figure
7.3:
Φ−1(S
KM(t))
stratified
by
dzgroup.
Linearity
and
semi-parallelism
indicate
areasonable
fitto
the
log-norm
alacceleratedfailure
timemodel
withrespectto
onepredictor.
The
fitfordzgroupisnotgreatbutoverallfitisgood.
Rem
ovefrom
considerationpredictorsthat
aremiss-
ingin
>0.2of
thepatients.Manyof
thesewere
onlycollected
forthesecond
phaseof
SUPPORT.
Ofthosevariablesto
beincluded
inthemodel,find
which
oneshave
enough
potentialpredictive
pow
erto
justifyallowingfornonlinearrelationshipsor
multiple
categories,which
spendmored.f.
For
each
variable
compute
Spearm
anρ2basedon
multiplelinearre-
gression
ofrank(x),rank(x)2
andthesurvivaltime,
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N167
Res
idua
l
Survival Probability
−3.
0−
2.0
−1.
00.
00.
51.
01.
52.
0
0.00.20.40.60.81.0
Res
idua
l
Survival Probability
−3.
0−
2.0
−1.
00.
00.
51.
01.
52.
0
0.00.20.40.60.81.0A
ge
Res
idua
l
Survival Probability
−3.
0−
2.0
−1.
00.
00.
51.
01.
52.
0
0.00.20.40.60.81.0
Mea
n A
rter
ial B
loo
d P
ress
ure
Day
3
Res
idua
l
Survival Probability
−3.
0−
2.0
−1.
00.
00.
51.
01.
52.
0
0.00.20.40.60.81.0
Figure
7.4:
Kaplan-M
eier
estimatesofdistributionsofnorm
alized,right-censoredresidualsfrom
thefitted
log-norm
al
survivalmodel.Residuals
are
stratified
byim
portantvariablesin
themodel
(byquartiles
ofcontinuousvariables),
plusarandom
variable
todepictthenaturalvariability(inthelower
rightplot).Theoreticalstandard
Gaussian
distributionsofresiduals
are
show
nwithathicksolidline.
Theupper
left
plotis
withrespectto
disease
group.
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N168
truncating
survivaltimeat
theshortestfollow-upfor
survivors(356
days).
Thisrids
thedata
ofcensoring
butcreatesmanyties
at356days.
shortest.follow.u
p←
min(d.tim
e[death
==0],
na.rm=TRUE)
d.tim
et←
pmin(d.tim
e,
shortest.follow.u
p)
w←
spearm
an2(d.tim
et∼
age
+num.co
+scoma
+meanbp
+hrt
+re
sp
+temp
+cre
a+
sod
+adlsc
+wblc
+pafi
+ph
+dzgro
up
+ra
ce,
p=2)
plo
t(w
,main=
'')
#Figure
7.5
Adj
uste
d ρ2
0.00
0.02
0.04
0.06
0.08
0.10
0.12
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
535
4 5
37 2
537
2 5
37 2
532
2 5
37 2
537
2 5
37 2
537
2 5
00 2
500
2 5
37 2
537
2 5
37 2
537
2
N d
fra
ce
resp
ag
e
num
.co
w
blc
te
mp
ad
lsc
hr
t so
d
ph
pafi
cr
ea
dzgr
oup
m
eanb
p
scom
a
Figure
7.5:
GeneralizedSpearm
anρ2rankcorrelationbetweenpredictors
andtruncatedsurvivaltime
Abetterapproach
isto
usethecompleteinform
a-tion
inthefailure
andcensoringtimes
bycomputing
Som
ers’
Dxyrank
correlationallowingforcensor-
ing. w←
rcorrcens(S∼
age
+num.co
+scoma
+meanbp
+hrt
+re
sp
+temp
+cre
a+
sod
+adlsc
+wblc
+pafi
+ph
+dzgro
up
+ra
ce)
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N169
plo
t(w
,main=
'')
#Figure
7.6
|Dxy
|
0.00
0.05
0.10
0.15
0.20
● ● ● ●
●
●
●
●
●
●
●
● ●
●
●
537
532
537
535
537
537
537
537
537
500
500
537
537
537
537 N
tem
p
wbl
c
sod
ra
ce
resp
hr
t nu
m.c
o
age
ad
lsc
ph
pa
fi
scom
a
dzgr
oup
cr
ea
mea
nbp
Figure
7.6:
Somers’
Dxyrankcorrelationbetweenpredictors
andoriginalsurvivaltime.
Fordzgrouporrace,the
correlationcoeffi
cientis
themaxim
um
correlationfrom
usingadummyvariable
torepresentthemost
frequentor
oneto
representthesecondmost
frequentcategory.
#Compute
number
of
missing
values
per
variable
sapply
(llis
t(age,n
um.co,sco
ma,m
eanbp,hrt
,re
sp,tem
p,crea,so
d,adlsc,
wblc
,pafi
,ph),
functio
n(x)
sum(is.n
a(x)))
age
num.co
scoma
meanbp
hrt
resp
temp
cre
aso
dadlsc
00
00
00
00
00
wblc
pafi
ph
537
37
#Can
also
do
naplot(naclus(support[acute,]))
#Can
also
use
the
Hmisc
naclus
and
naplot
functions
to
do
this
#Impute
missing
values
with
normal
or
modal
values
wblc
.i←
impute
(wblc
,9)
pafi.i←
impute
(pafi
,333.3
)ph.i
←im
pute
(ph,
7.4
)ra
ce2←
race
levels
(ra
ce2)←
list(white=
'white
',oth
er=
levels
(ra
ce)[−
1])
race2[is.n
a(ra
ce2)]←
'white
'
dd←
datadist(dd,
wblc.i
,pafi.i
,ph.i
,ra
ce2)
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N170
Doaform
alredundancy
analysis
usingmorethan
pairwiseassociations,andallow
fornon-monotonic
transformations
inpredicting
each
predictorfrom
all
otherpredictors.Thisanalysisrequires
missing
val-
uesto
beimputedso
asto
notgreatlyreduce
the
samplesize.
redun(∼
cre
a+
age
+sex
+dzgro
up
+num.co
+scoma
+adlsc
+ra
ce2
+mea
nbp
+hrt
+re
sp
+temp
+so
d+
wblc
.i+
pafi.i+
ph.i
,nk=4)
Redundancy
Analysis
redun(fo
rmula
=∼cre
a+
age
+sex
+dzgro
up
+num.co
+scoma
+adlsc
+ra
ce2
+meanbp
+hrt
+re
sp
+temp
+so
d+
wblc
.i+
pafi
.i+
ph.i,
nk
=4)
n:
537
p:
16
nk:
4
Number
of
NAs:
0
Tra
nsfo
rmation
of
targ
et
varia
ble
sfo
rced
tobe
linear
R2
cutoff
:0.9
Type:
ord
inary
R2
with
which
each
varia
ble
can
be
pre
dicted
from
all
oth
er
varia
ble
s:
cre
aage
sex
dzgro
up
num.co
scoma
adlsc
race2
0.133
0.246
0.132
0.451
0.147
0.418
0.153
0.151
mea
nbp
hrt
resp
temp
sod
wblc
.i
pafi
.i
ph.i
0.178
0.258
0.131
0.197
0.135
0.093
0.143
0.171
No
redundant
varia
ble
s
Betterapproach
togaugingpredictive
potentialand
allocating
d.f.:
�Allowallcontinuous
variablesto
have
athemaxi-
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N171
mum
numberofknotsentertained,inalog-norm
alsurvivalmodel
�Mustuseimputation
toavoidlosing
data
�Fita“saturated”maineffects
model
�Makes
fulluseof
censored
data
�Had
tolim
itto
4knots,forcescomato
belinear,
andom
itph.ito
avoidsingularity
k←
4f←
psm
(S∼
rcs(age,k
)+sex+
dzgro
up+pol(
num.co,2)+
scoma+
pol(
adlsc,2)+
race+
rcs(meanbp,k
)+rc
s(hrt
,k)+
rcs(re
sp,k
)+rc
s(temp,k
)+rc
s(crea,3)+
rcs(so
d,k
)+rc
s(wblc.i
,k)+
rcs(pafi.i
,k),
dist=
'lognorm
al')
plo
t(anova(f))
#Figure
7.7
χ2 − d
f
010
2030
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
dzgr
oup
cr
ea
mea
nbp
ag
e
pafi.
i sc
oma
re
sp
adls
c
wbl
c.i
hrt
num
.co
so
d
race
te
mp
se
x
Figure
7.7:
Partialχ2statisticsforassociationofeach
predictorwithresponse
from
saturatedmain
effects
model,
penalizedford.f.
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N172
�Figure7.7properlyblinds
theanalystto
theform
ofeffects
(tests
oflinearity).
�Fitalog-norm
alsurvival
model
withnumber
ofparameterscorresponding
tononlineareffects
de-
term
ined
from
Figure7.7.
For
themostprom
ising
predictors,five
knotscanbeallocated,
asthere
arefewer
singularityproblemsonce
less
prom
ising
predictorsaresimplified.
f←
psm
(S∼
rcs(age,5)+
sex+
dzgro
up+num.co+
scoma+
pol(
adlsc,2)+
race2+rc
s(meanbp,5)+
rcs(hrt
,3)+
rcs(re
sp,3)+
temp+
rcs(crea,4)+
sod+rc
s(wblc.i
,3)+
rcs(pafi.i
,4),
dist=
'lognorm
al')
#'gaussian'
for
S+
prin
t(f,
latex=TRUE)
Parametric
SurvivalModel:
LogNorm
alDistribution
psm(formula=
S~
rcs(age,
5)+
sex+dzgroup+
num.co+scoma+
pol(adlsc,2)
+race2+rcs(meanbp,
5)
+rcs(hrt,3)+rcs(resp,
3)
+temp
+rcs(crea,
4)+
sod+rcs(wblc.i,
3)
+rcs(pafi.i,
4),dist
="lognormal")
ModelLikelihood
Discrim
ination
Ratio
Test
Indexes
Obs
537
LRχ2
236.83
R2
0.594
Events
356
d.f.
30g
1.959
σ2.2308
Pr(>
χ2)<
0.0001
g r7.095
Coef
S.E.
WaldZ
Pr(>|Z|)
(Intercept)
-5.6883
3.7851
-1.50
0.1329
age
-0.0148
0.0309
-0.48
0.6322
age’
-0.0412
0.1078
-0.38
0.7024
age”
0.1670
0.5594
0.30
0.7653
age”’
-0.2099
1.3707
-0.15
0.8783
sex=
male
-0.0737
0.2181
-0.34
0.7354
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N173
Coef
S.E.
WaldZ
Pr(>|Z|)
dzgroup=Com
a-2.0676
0.4062
-5.09
<0.0001
dzgroup=MOSFw/M
alig
-1.4664
0.3112
-4.71
<0.0001
num.co
-0.1917
0.0858
-2.23
0.0255
scom
a-0.0142
0.0044
-3.25
0.0011
adlsc
-0.3735
0.1520
-2.46
0.0140
adlsc2
0.0442
0.0243
1.82
0.0691
race2=
other
0.2979
0.2658
1.12
0.2624
meanbp
0.0702
0.0210
3.34
0.0008
meanbp’
-0.3080
0.2261
-1.36
0.1732
meanbp”
0.8438
0.8556
0.99
0.3241
meanbp”’
-0.5715
0.7707
-0.74
0.4584
hrt
-0.0171
0.0069
-2.46
0.0140
hrt’
0.0064
0.0063
1.02
0.3090
resp
0.0454
0.0230
1.97
0.0483
resp’
-0.0851
0.0291
-2.93
0.0034
temp
0.0523
0.0834
0.63
0.5308
crea
-0.4585
0.6727
-0.68
0.4955
crea’
-11.5176
19.0027
-0.61
0.5444
crea”
21.9840
31.0113
0.71
0.4784
sod
0.0044
0.0157
0.28
0.7792
wblc.i
0.0746
0.0331
2.25
0.0242
wblc.i’
-0.0880
0.0377
-2.34
0.0195
pafi.i
0.0169
0.0055
3.07
0.0021
pafi.i’
-0.0569
0.0239
-2.38
0.0173
pafi.i”
0.1088
0.0482
2.26
0.0239
Log(scale)
0.8024
0.0401
19.99
<0.0001
7.3
Summarizingth
eFitted
Model
�Plottheshapeof
theeffectof
each
predictoron
logsurvivaltime.
�Alleffects
centered:canbeplaced
oncommon
scale
�Waldχ2statistics,penalized
ford.f.,plottedin
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N174
Table
7.2:
Wald
StatisticsforS
χ2
d.f.
P
age
15.99
40.0030
Nonlinear
0.23
30.9722
sex
0.11
10.7354
dzgroup
45.69
2<
0.0001
num.co
4.99
10.0255
scoma
10.58
10.0011
adlsc
8.28
20.0159
Nonlinear
3.31
10.0691
race2
1.26
10.2624
meanbp
27.62
4<
0.0001
Nonlinear
10.51
30.0147
hrt
11.83
20.0027
Nonlinear
1.04
10.3090
resp
11.10
20.0039
Nonlinear
8.56
10.0034
temp
0.39
10.5308
crea
33.63
3<
0.0001
Nonlinear
21.27
2<
0.0001
sod
0.08
10.7792
wblc.i
5.47
20.0649
Nonlinear
5.46
10.0195
pafi.i
15.37
30.0015
Nonlinear
6.97
20.0307
TOTAL
NONLIN
EAR
60.48
14
<0.0001
TOTAL
261.47
30
<0.0001
descending
order
plo
t(Pre
dict(f,
ref.zero=TRUE))
#Figure
7.8
latex(anova(f),
file=
'',
label=
'su
pport−anovat')
#Table
7.2
plo
t(anova(f))
#Figure
7.9
options(dig
its=3)
plo
t(summary(f),
log=TRUE,
main=
'')
#Figure
7.10
7.4
Intern
alValidation
ofth
eFitted
ModelUsingth
eBootstrap
Validateindexesdescribing
thefitted
model.
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N175
log(T)
−2
−101
01
23
45
67
adls
c
2030
4050
6070
8090
age
0 2
4 6
8
crea
AR
F/M
wC
oma
MO
SF
w/
●
●
●
dzgr
oup
50
100
150
hrt
40
60
80
1001
2014
0
mea
nbp
01
23
45
6
num
.co
100
200
300
400
500
−2
−1
01
pafi.
i
−2
−101
whi
teot
her
●●
race
2
1020
3040
resp
0 2
0 4
0 6
0 8
010
0
scom
a
fem
ale
mal
e
●●
sex
1301
3514
0145
1501
55
sod
3536
3738
3940
tem
p
010
2030
40
−2
−1
01
wbl
c.i
Figure
7.8:
Effectofeach
predictoronlogsurvivaltime.
Predictedvalues
havebeencenteredso
thatpredictions
atpredictorreference
values
are
zero.Pointw
ise0.95confidence
bandsare
alsoshow
n.AsallY-axes
havethesame
scale,itis
easy
toseewhichpredictors
are
strongest.
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N176
χ2 − d
f
010
2030
40
●
●
●
● ●
● ● ●
●
●
●
●
● ● ●
dzgr
oup
cr
ea
mea
nbp
pa
fi.i
age
hr
t sc
oma
re
sp
adls
c
num
.co
w
blc.
i ra
ce2
te
mp
se
x
sod
Figure
7.9:
Contributionofvariablesin
predictingsurvivaltimein
log-norm
almodel
0.10
0.50
1.00
2.00
4.00
age
− 7
4.5:
47.9
num
.co
− 2
:1sc
oma
− 3
7:0
adls
c −
3.3
8:0
mea
nbp
− 1
11:5
9hr
t − 1
26:7
5re
sp −
32:
12te
mp
− 3
8.5:
36.4
crea
− 2
.6:0
.9so
d −
142
:134
wbl
c.i −
18.
2:8.
1pa
fi.i −
323
:142
sex
− fe
mal
e:m
ale
0.99
0.9
0.7
0.8
0.95
dzgr
oup
− C
oma:
AR
F/M
OS
F w
/Sep
sis
dzgr
oup
− M
OS
F w
/Mal
ig:A
RF
/MO
SF
w/S
epsi
sra
ce2
− o
ther
:whi
te
Figure
7.10:
Estim
atedsurvivaltimeratiosfordefault
settingsofpredictors.Forexample,when
agechanges
from
itslower
quartileto
theupper
quartile(47.9yto
74.5y),mediansurvivaltimedecreasesbymore
thanhalf.Different
shaded
areasofbars
indicate
differentconfidence
levels,rangingfrom
0.7
to0.99.
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N177
#First
add
data
to
model
fit
so
bootstrap
can
re-sample
#from
the
data
g←
update
(f,
x=TRUE,
y=TRUE)
set.seed(7
17)
latex(validate(g,B=120,
dxy=
TRUE),
dig
its=2,
siz
e=
'Ssize
')
Index
Original
Training
Test
Optimism
Corrected
nSam
ple
Sam
ple
Sam
ple
Index
Dxy
0.49
0.51
0.46
0.05
0.43
120
R2
0.59
0.66
0.54
0.12
0.47
120
Intercept
0.00
0.00
−0.06
0.06
−0.06
120
Slope
1.00
1.00
0.90
0.10
0.90
120
D0.48
0.55
0.42
0.13
0.35
120
U0.00
0.00
−0.01
0.01
−0.01
120
Q0.48
0.55
0.43
0.12
0.36
120
g1.96
2.06
1.86
0.19
1.76
120
�From
DxyandR2thereisamoderateam
ount
ofoverfitting.
�Slopeshrinkagefactor
(0.90)
isnottroublesom
e
�Almostunbiased
estimateof
future
predictive
dis-
criminationon
similarpatients
isthecorrected
Dxyof
0.43.
Validatepredicted1-yearsurvivalprobabilities.Use
asm
ooth
approach
that
doesnotrequirebinning7
1and
useless
preciseKaplan-Meier
estimates
obtained
bystratifyingpatientsby
thepredictedprobability,with
atleast60
patients
per
group.
set.seed(7
17)
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N178
cal←
calibrate(g,
u=1,B=120)
plo
t(cal,
subtitle
s=FALSE)
cal←
calibrate(g,
cmeth
od=
'KM
',
u=1,m=60,B=120,
pr=
FALSE)
plo
t(cal,
add=
TRUE)
#Figure
7.11
0.0
0.2
0.4
0.6
0.8
0.00.20.40.60.8
Pre
dict
ed 1
Yea
r S
urvi
val
Fraction Surviving 1 Years
●●
●
●
●
●
●
●
Figure
7.11:
Bootstrapvalidationofcalibrationcurve.
Dots
representapparentcalibrationaccuracy;×
are
bootstrap
estimatescorrectedforoverfitting,basedonbinningpredictedsurvivalprobabilitiesandandcomputingKaplan-M
eier
estimates.
Black
curveistheestimatedobserved
relationship
usinghareandthebluecurveistheoverfitting-corrected
hareestimate.Thegray-scale
linedepicts
theidealrelationship.
7.5
Approxim
atingth
eFullM
odel
The
fitted
log-norm
almodelisperhaps
toocomplex
forroutineuseandforroutinedata
collection.
Let
usdevelopasimplified
model
that
canpredictthe
predictedvalues
ofthefullmodelwithhigh
accuracy
(R2=0.96).
The
simplification
isdone
usingafast
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N179
backwardstepdownagainstthefullmodelpredicted
values.
Z←
pre
dict(f)
#X*beta
hat
a←
ols
(Z∼
rcs(age,5)+
sex+
dzgro
up+num.co+
scoma+
pol(
adlsc,2)+
race2+
rcs(meanbp,5)+
rcs(hrt
,3)+
rcs(re
sp,3)+
temp+
rcs(crea,4)+
sod+rc
s(wblc.i
,3)+
rcs(pafi.i
,4),
sigma=1)
#sigma=1
is
used
to
prevent
sigma
hat
from
being
zero
when
#R2=1.0
since
we
start
out
by
approximating
Zwith
all
#component
variables
fastbw(a,
aic
s=10000)
#fast
backward
stepdown
Delete
dChi−
Sq
d.f.P
Resid
ual
d.f.P
AIC
R2
sod
0.4
31
0.512
0.4
31
0.5117
−1.57
1.000
sex
0.5
71
0.451
1.0
02
0.6073
−3.00
0.999
temp
2.2
01
0.138
3.2
03
0.3621
−2.80
0.998
race2
6.8
11
0.009
10.01
40.0402
2.0
10.994
wblc
.i
29.52
20.000
39.53
60.0000
27.53
0.976
num.co
30.84
10.000
70.36
70.0000
56.36
0.957
resp
54.18
20.000
124.55
90.0000
106.55
0.924
adlsc
52.46
20.000
177.00
11
0.0000
155.00
0.892
pafi
.i
66.78
30.000
243.79
14
0.0000
215.79
0.851
scoma
78.07
10.000
321.86
15
0.0000
291.86
0.803
hrt
83.17
20.000
405.02
17
0.0000
371.02
0.752
age
68.08
40.000
473.10
21
0.0000
431.10
0.710
cre
a314.47
30.000
787.57
24
0.0000
739.57
0.517
mea
nbp
403.04
40.000
1190.61
28
0.0000
1134.61
0.270
dzgro
up
441.28
20.000
1631.89
30
0.0000
1571.89
0.000
Appro
xim
ate
Estimate
saft
er
Deleting
Facto
rs
Coef
S.E
.Wald
ZP
[1,]−0.5928
0.04315−13.74
0
Facto
rsin
Fin
al
Model
None
f.appro
x←
ols
(Z∼
dzgro
up
+rc
s(meanbp,5
)+
rcs(crea,4
)+
rcs(age,5
)+
rcs(hrt
,3)
+scoma
+rc
s(pafi.i
,4)
+pol(
adlsc,2)+
rcs(re
sp,3
),
x=TRUE)
f.appro
x$stats
nModelL.R
.d.f.
R2
gSigma
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N180
537.000
1688.225
23.000
0.957
1.915
0.370
�Estimatevariance–covariancematrixof
thecoef-
ficients
ofreducedmodel
�Thiscovariance
matrixdoes
notincludethescale
parameter
V←
vcov(f,regcoef.only=TRUE)
#var(full
model)
X←
g$x
#full
model
design
x←
f.appro
x$x
#approx.
model
design
w←
solve(t(x)%*%
x,
t(x))%*%
X#
contrast
matrix
v←
w%*%
V%*%
t(w
)
Com
pare
variance
estimates
(diagonals
ofv)with
variance
estimates
from
areducedmodelthat
isfit-
tedagainsttheactualoutcom
es.
f.sub←
psm
(S∼
dzgro
up
+rc
s(meanbp,5
)+
rcs(crea,4
)+
rcs(age,5
)+
rcs(hrt
,3)
+scoma
+rc
s(pafi.i
,4)
+pol(
adlsc,2)+
rcs(re
sp,3
),
dist=
'lognorm
al')
#'gaussian'
for
S+
diag(v)/diag(vcov(f.sub,regcoef.only=TRUE))
Interc
ept
dzgro
up=Com
adzgro
up=MOSF
w/Malig
0.981
0.979
0.979
mea
nbp
meanbp'
meanbp''
0.977
0.979
0.979
meanbp'''
cre
acrea
'
0.979
0.979
0.979
crea
''
age
age'
0.979
0.982
0.981
age''
age'''
hrt
0.981
0.980
0.978
hrt
'scoma
pafi
.i
0.976
0.979
0.980
pafi
.i'
pafi
.i''
adlsc
0.980
0.980
0.981
adlsc∧2
resp
resp
'
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N181
Table
7.3:
Wald
StatisticsforZ
χ2
d.f.
P
dzgroup
55.94
2<
0.0001
meanbp
29.87
4<
0.0001
Nonlinear
9.84
30.0200
crea
39.04
3<
0.0001
Nonlinear
24.37
2<
0.0001
age
18.12
40.0012
Nonlinear
0.34
30.9517
hrt
9.87
20.0072
Nonlinear
0.40
10.5289
scoma
9.85
10.0017
pafi.i
14.01
30.0029
Nonlinear
6.66
20.0357
adlsc
9.71
20.0078
Nonlinear
2.87
10.0904
resp
9.65
20.0080
Nonlinear
7.13
10.0076
TOTAL
NONLIN
EAR
58.08
13
<0.0001
TOTAL
252.32
23
<0.0001
0.981
0.978
0.977
The
ratios
ranged
from
0.978to
0.982.
f.appro
x$var←
vla
tex(anova(f.appro
x,
test=
'Chisq
',
ss=
FALSE),
file=
'',
label=
'suport.a
novaa
')
Equationforsimplified
model:
#Typeset
mathematical
form
of
approximate
model
latex(f.appro
x,
file=
'')
E(Z)=
Xβ,
where
Xβ= −2.51
−1.94{Com
a}−
1.75{MOSFw/M
alig}
+0.068m
eanbp−3.08×10
−5(m
eanbp−41.8)3 +
+7.9×10
−5(m
eanbp−
61)3 +
−4.91×10
−5(m
eanbp−73)3 +
+2.61×10
−6(m
eanbp−109)
3 +−1.7×10
−6(m
eanbp−135)
3 +
−0.553crea−0.229(crea−0.6)
3 ++0.45(crea−1.1)
3 +−0.233(crea−1.94)3 +
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N182
+0.0131(crea−7.32)3 +
−0.0165age−1.13×10
−5(age−28.5)3 +
+4.05×10
−5(age−49.5)3 +
−2.15×10
−5(age−63.7)3 +−2.68×10
−5(age−72.7)3 +
+1.9×10
−5(age−85.6)3 +
−0.0136hrt+6.09×10
−7(hrt−60)3 +−1.68×10
−6(hrt−111)
3 ++1.07×10
−6(hrt−140)
3 +
−0.0135
scom
a
+0.0161pafi
.i−4.77×10
−7(pafi
.i−88)3 +
+9.11×10
−7(pafi
.i−167)
3 +
−5.02×10
−7(pafi
.i−276)
3 ++6.76×10
−8(pafi
.i−426)
3 +
−0.3693
adlsc+0.0409
adlsc2
+0.0394resp−9.11×10
−5(resp−10)3 +
+0.000176(resp−24)3 +−8.5×10
−5(resp−39)3 +
and{c}
=1ifsubject
isin
grou
pc,
0otherwise;
(x) +
=xifx>
0,0otherwise.
Nom
ogram
forpredicting
medianandmeansurvival
time,basedon
approximatemodel:
#Derive
Sfunctions
that
express
mean
and
quantiles
#of
survival
time
for
specific
linear
predictors
#analytically
expected.surv←
Mean(f)
quantile.s
urv←
Quantile
(f)
latex(expecte
d.surv
,file=
'',
type=
'Sinput')
expected.surv←
functio
n(lp
=NULL,
parm
s=
0.802352037606488)
{names
(parm
s)←
NULL
exp(lp
+exp(2
*parm
s)/2)
}
latex(quantile.surv
,file=
'',
type=
'Sinput')
quantile.s
urv←
functio
n(q
=0.5
,lp
=NULL,
parm
s=
0.802352037606488)
{names
(parm
s)←
NULL
f←
functio
n(lp
,q,
parm
s)
lp+
exp(parm
s)
*qnorm
(q)
names
(q)←
form
at(q)
dro
p(exp(oute
r(lp
,q,FUN
=f,
parm
s=
parm
s)))
}
median.surv
←fu
nctio
n(x)
quantile.s
urv(lp=x)
#Improve
variable
labels
for
the
nomogram
f.appro
x←
Newlabels(f.appro
x,
c('Disease
Group
','Mean
Arteria
lBP
',
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N183
'Cre
atin
ine
','Age','Heart
Rate
','SUPPORT
Com
aScore
',
'PaO
2/(.0
1*FiO
2)','ADL','Resp
.Rate
'))
nom←
nomogram(f.appro
x,
pafi.i=
c(0,
50,
100,
200,
300,
500,
600,
700,
800,
900),
fun=list('Median
Surv
ival
Tim
e'=
median.surv
,'Mean
Surv
ival
Tim
e'
=expected.surv
),
fun.a
t=c(.1
,.25,.5
,1,2
,5,1
0,2
0,4
0))
plo
t(nom
,cex.v
ar=1,
cex.a
xis=.75,
lmgp=
.25)
#Figure
7.12
Poi
nts
010
2030
4050
6070
8090
100
Dis
ease
Gro
upC
oma
AR
F/M
OS
F w
/Sep
sis
MO
SF
w/M
alig
Mea
n A
rter
ial B
P0
2040
6080
120
Cre
atin
ine
53
21
0
67
89
1011
12
Age
100
7060
5030
10
Hea
rt R
ate
300
200
100
500
SU
PP
OR
T C
oma
Sco
re10
070
5030
10
PaO
2/(.
01*F
iO2)
050
100
200
300
500
700
900
AD
L4.
52
10
57
Res
p. R
ate
05
15
6560
5550
4540
3530
Tota
l Poi
nts
050
100
150
200
250
300
350
400
450
Line
ar P
redi
ctor
−7
−5
−3
−1
12
34
Med
ian
Sur
viva
l Tim
e0.
10.2
50.5
12
510
2040
Mea
n S
urvi
val T
ime
0.10
.250
.51
25
1020
40
Figure
7.12:
Nomogram
forpredictingmedianandmeansurvivaltime,
basedonapproxim
ationoffullmodel
CHAPTER
7.
PARAMETRIC
SURVIV
ALMODELIN
GAND
MODELAPPROXIM
ATIO
N184
SPackagesandFunctionsUsed
Packages
Purpose
Functions
Hmisc
Miscellaneousfunctions
describe,ecdf,naclus,
varclus,llist,spearman2
describe,impute,latex
rms
Modeling
datadist,psm,rcs,ols,fastbw
Modelpresentation
survplot,Newlabels,Function,
Mean,Quantile,nomogram
Modelvalidation
validate,calibrate
Note:
Allpackagesareavailablefrom
CRAN
Bibliogra
phy
[1]D.G.Altman.Categorisingcontinuouscovariates
(letterto
theeditor).
BritJCancer,64:975,1991.
[26]
[2]D.G.Altman.Suboptimal
analysisusing‘optimal’cutpoints.BritJCancer,78:556–557,1998.
[26]
[3]D.G.Altman
andP.K.Andersen.Bootstrap
investigationofthestability
ofaCox
regressionmodel.StatMed,
8:771–783,1989.
[68]
[4]D.G.Altman,B.Lausen,W.Sauerbrei,andM.Schumacher.Dangersofusing‘optimal’cutpoints
intheevaluation
ofprognostic
factors.
JNat
CancerInst,86:829–835,1994.
[26,28]
[5]A.C.Atkinson.A
note
onthegeneralized
inform
ationcriterionforchoiceofamodel.Biometrika,67:413–418,
1980.
[39,67]
[6]P.C.Austin.Bootstrap
modelselectionhad
similarperform
ance
forselectingauthenticandnoisevariablescompared
tobackw
ardvariable
elim
ination:asimulationstudy.
JClin
Epi,61:1009–1017,2008.
[68]
[7]P.C.Austin,J.
V.Tu,andD.S.Lee.Logisticregressionhad
superiorperform
ance
compared
withregressiontrees
forpredictingin-hospital
mortalityin
patients
hospitalized
withheart
failure.JClin
Epi,63:1145–1155,2010.
[45]
[8]H.Belcher.Theconceptofresidual
confoundingin
regressionmodelsandsomeapplications.
StatMed,11:1747–
1758,1992.
[26]
[9]D.A.Belsley.ConditioningDiagnostics:
Collinearity
andWeakDatain
Regression.Wiley,New
York,
1991.
[74]
[10]D.A.Belsley,E.Kuh,andR.E.Welsch.
RegressionDiagnostics:
IdentifyingInfluential
DataandSources
of
Collinearity.Wiley,New
York,
1980.
[89,90]
[11]J.
K.Benedetti,P.Liu,H.N.Sather,J.
Seinfeld,andM.A.Epton.Effective
sample
size
fortestsofcensored
survival
data.
Biometrika,69:343–349,1982.
[69]
[12]K.Berhane,
M.Hauptm
ann,andB.Langholz.Usingtensorproduct
splines
inmodelingexposure–time–response
relationships:
Applicationto
theColoradoPlateau
Uranium
Minerscohort.
StatMed,27:5484–5496,2008.
[57]
[13]M.Blettner
andW.Sauerbrei.Influence
ofmodel-buildingstrategiesontheresultsofacase-controlstudy.
Stat
Med,12:1325–1338,1993.
[118]
[14]J.
G.Booth
andS.Sarkar.
Monte
Carlo
approxim
ationofbootstrap
variances.
Am
Statistician,52:354–357,1998.
[108]
[15]R.Bordley.
Statistical
decisionmakingwithoutmath.Chance,20(3):39–44,2007.
[8]
[16]L.Breim
an.Thelittlebootstrap
andother
methodsfordim
ensionalityselectionin
regression:X-fixedprediction
error.
JAm
StatAssoc,
87:738–754,1992.
[67,68,111]
[17]L.Breim
anandJ.
H.Friedman.Estim
atingoptimal
transformationsformultiple
regressionandcorrelation(w
ith
discussion).
JAm
StatAssoc,
80:580–619,1985.
[82]
185
BIB
LIO
GRAPHY
186
[18]L.Breim
an,J.
H.Friedman,R.A.Olshen,andC.J.
Stone.
ClassificationandRegressionTrees.Wadsw
orth
and
Brooks/Cole,PacificGrove,CA,1984.
[43]
[19]W.M.BriggsandR.Zaretzki.Theskill
plot:
Agraphical
techniqueforevaluatingcontinuousdiagnostictests(w
ith
discussion).
Biometrics,64:250–261,2008.
[8]
[20]D.Brownstone.
Regressionstrategies.
InProceedingsofthe20th
Sym
posium
ontheInterfacebetweenComputer
Science
andStatistics,pages
74–79,Washington,DC,1988.American
Statistical
Association.
[118]
[21]P.Buettner,C.Garbe,
andI.Guggenmoos-Holzmann.Problemsin
definingcutoffpoints
ofcontinuousprognostic
factors:
Example
oftumor
thickn
essin
prim
arycutaneousmelanoma.
JClin
Epi,50:1201–1210,1997.
[26]
[22]J.
M.Cham
bersandT.J.
Hastie,
editors.
Statistical
Modelsin
S.Wadsw
orth
andBrooks/Cole,PacificGrove,CA,
1992.
[57]
[23]C.Chatfield.Avoidingstatisticalpitfalls
(withdiscussion).
Statistical
Sci,6:240–268,1991.
[90]
[24]C.Chatfield.Modeluncertainty,dataminingandstatisticalinference
(withdiscussion).
JRoy
StatSocA,158:419–
466,1995.
[65,118]
[25]S.Chatterjee
andB.Price.RegressionAnalysisby
Example.Wiley,New
York,
secondedition,1991.
[73]
[26]A.Ciampi,J.
Thiffault,J.-P.Nakache,andB.Asselain.Stratificationby
stepwiseregression,correspondence
analysis
andrecursivepartition.CompStatDataAnalysis,1986:185–204,1986.
[77]
[27]W.S.Cleveland.Robust
locally
weightedregressionandsm
oothingscatterplots.JAm
StatAssoc,
74:829–836,
1979.
[41]
[28]E.F.CookandL.Goldman.Asymmetricstratification:Anoutlineforan
efficientmethodforcontrollingconfounding
incohortstudies.
Am
JEpi,127:626–639,1988.
[45]
[29]J.
B.Copas.Regression,predictionandshrinkage(w
ithdiscussion).
JRoy
StatSocB,45:311–354,1983.
[71,72]
[30]J.
B.Copas.Cross-validationshrinkageofregressionpredictors.JRoy
StatSocB,49:175–183,1987.
[116]
[31]D.R.Cox.Regressionmodelsandlife-tables(w
ithdiscussion).
JRoy
StatSocB,34:187–220,1972.
[59]
[32]S.L.Crawford,S.L.Tennstedt,andJ.
B.McK
inlay.
Acomparisonofanalyticmethodsfornon-random
missingness
ofoutcomedata.
JClin
Epi,48:209–219,1995.
[94]
[33]N.J.
CrichtonandJ.
P.Hinde.
Correspondence
analysisas
ascreeningmethodforindicants
forclinical
diagnosis.
StatMed,8:1351–1362,1989.
[77]
[34]R.B.D’Agostino,A.J.
Belanger,E.W.Markson,M.Kelly-H
ayes,andP.A.Wolf.Developmentofhealthrisk
appraisalfunctionsin
thepresence
ofmultiple
indicators:
TheFramingham
Studynursinghomeinstitutionalization
model.StatMed,14:1757–1770,1995.
[74,76]
[35]C.E.Davis,J.
E.Hyde,
S.I.Bangdiwala,
andJ.
J.Nelson.Anexam
ple
ofdependencies
amongvariablesin
aconditional
logisticregression.In
S.MoolgavkarandR.Prentice,editors,
ModernStatistical
Methodsin
Chronic
Disease
Epidem
iology,
pages
140–147.Wiley,New
York,
1986.
[74]
[36]S.Derksen
andH.J.
Keselman.Backw
ard,forwardandstepwiseautomated
subsetselectionalgorithms:
Frequency
ofobtainingauthenticandnoisevariables.
British
JMathStatPsych,45:265–282,1992.
[66]
[37]T.F.Devlin
andB.J.
Weeks.Splinefunctionsforlogisticregressionmodeling.In
ProceedingsoftheEleventh
Annual
SASUsers
GroupInternational
Conference,pages
646–651,Cary,NC,1986.SASInstitute,Inc.
[35]
[38]W.D.Dupont.
Statistical
ModelingforBiomedical
Researchers.
Cam
bridgeUniversity
Press,Cam
bridge,
UK,
secondedition,2008.
[192]
[39]S.Durrleman
andR.Sim
on.Flexible
regressionmodelswithcubic
splines.StatMed,8:551–561,1989.
[38]
BIB
LIO
GRAPHY
187
[40]B.Efron.
Estim
atingtheerrorrate
ofapredictionrule:Im
provementoncross-validation.
JAm
StatAssoc,
78:316–331,1983.
[112,115,116]
[41]B.EfronandR.Tibshirani.AnIntroductionto
theBootstrap.Chapman
andHall,New
York,
1993.
[115]
[42]B.EfronandR.Tibshirani.
Improvements
oncross-validation:The.632+
bootstrap
method.JAm
StatAssoc,
92:548–560,1997.
[115]
[43]J.
Fan
andR.A.Levine.
Toam
nio
ornotto
amnio:That
isthedecisionforBayes.Chance,20(3):26–32,2007.[8]
[44]D.FaraggiandR.Sim
on.
Asimulationstudyofcross-validationforselectingan
optimal
cutpointin
univariate
survival
analysis.StatMed,15:2203–2213,1996.
[26]
[45]J.
J.Faraw
ay.Thecost
ofdataanalysis.JCompGraphStat,1:213–229,1992.
[97,115,117]
[46]V.Fedorov,
F.Mannino,andR.Zhang.Consequencesofdichotomization.Pharm
Stat,8:50–61,2009.
[7,26]
[47]D.Freedman,W.Navidi,andS.Peters.
OntheIm
pactofVariableSelectionin
FittingRegressionEquations,pages
1–16.Lecture
Notesin
EconomicsandMathem
atical
Systems.Springer-Verlag,New
York,
1988.
[116]
[48]J.
H.Friedman.Avariablespan
smoother.TechnicalReport5,Lab
oratoryforComputationalStatistics,Departm
ent
ofStatistics,Stanford
University,1984.
[82]
[49]M.H.GailandR.M.Pfeiffer.Oncriteria
forevaluatingmodelsofabsolute
risk.Biostatistics,6(2):227–239,2005.
[8]
[50]T.GneitingandA.E.Raftery.Strictlyproper
scoringrules,prediction,andestimation.JAm
StatAssoc,
102:359–
378,2007.
[8]
[51]U.S.Govindarajulu,D.Spiegelman,S.W.Thurston,B.Ganguli,
andE.A.Eisen.Comparingsm
oothingtechniques
inCox
modelsforexposure-response
relationships.
StatMed,26:3735–3752,2007.
[39]
[52]P.M.GrambschandP.C.O’Brien.Theeff
ectsoftransformationsandprelim
inarytestsfornon-linearity
inregression.
StatMed,10:697–709,1991.
[48,66]
[53]R.J.
Gray.
Flexible
methodsforanalyzingsurvival
datausingsplines,withapplicationsto
breast
cancerprognosis.
JAm
StatAssoc,
87:942–951,1992.
[56,72]
[54]R.J.
Gray.
Spline-based
testsin
survival
analysis.Biometrics,50:640–652,1994.
[56]
[55]M.J.
Greenacre.Correspondence
analysis
ofmultivariate
categorical
databy
weightedleast-squares.Biometrika,
75:457–467,1988.
[77]
[56]S.Greenland.When
should
epidem
iologicregressionsuse
random
coeffi
cients?Biometrics,56:915–921,2000.[66,
92]
[57]F.E.Harrell.
TheLOGIST
Procedure.In
SUGISupplementalLibrary
Users
Guide,
pages
269–293.SASInstitute,
Inc.,Cary,NC,Version5edition,1986.
[67]
[58]F.E.Harrell,
K.L.Lee,R.M.Califf,D.B.Pryor,andR.A.Rosati.Regressionmodelingstrategiesforim
proved
prognostic
prediction.StatMed,3:143–152,1984.
[69]
[59]F.E.Harrell,
K.L.Lee,D.B.Matchar,andT.A.Reichert.Regressionmodelsforprognosticprediction:Advantages,
problems,andsuggestedsolutions.
CancerTreatmentReports,69:1071–1077,1985.
[69]
[60]F.E.Harrell,
K.L.Lee,andB.G.Pollo
ck.Regressionmodelsin
clinical
studies:
Determiningrelationshipsbetween
predictors
andresponse.JNat
CancerInst,80:1198–1202,1988.
[42]
[61]F.E.Harrell,
P.A.Margolis,S.Gove,K.E.Mason,E.K.Mulholland,D.Lehmann,L.Muhe,
S.Gatchalian,and
H.F.Eichenwald.Developmentofaclinicalpredictionmodelforan
ordinaloutcome:
TheWorldHealthOrganization
ARIMulticentreStudyofclinical
signsandetiologic
agents
ofpneumonia,sepsis,
andmeningitisin
younginfants.
StatMed,17:909–944,1998.
[72,95]
BIB
LIO
GRAPHY
188
[62]T.Hastie,
R.Tibshirani,andJ.
H.Friedman.TheElements
ofStatistical
Learning.
Springer,New
York,
second
edition,2008.ISBN-10:0387848576;ISBN-13:978-0387848570.
[47]
[63]T.J.
HastieandR.J.
Tibshirani.
Generalized
AdditiveModels.
Chapman
&Hall/CRC,Boca
Raton,FL,1990.
ISBN
9780412343902.
[47]
[64]S.G.HilsenbeckandG.M.Clark.Practical
p-valueadjustmentforoptimallyselected
cutpoints.StatMed,15:103–
112,1996.
[26]
[65]W.Hoeff
ding.Anon-param
etrictest
ofindependence.AnnMathStat,19:546–557,1948.
[77]
[66]N.Hollander,W.Sauerbrei,andM.Schumacher.Confidence
intervalsfortheeff
ectofaprognostic
factor
after
selectionofan
‘optimal’cutpoint.
StatMed,23:1701–1713,2004.
[26,28]
[67]C.M.HurvichandC.L.Tsai.
Theim
pactofmodel
selectiononinference
inlinearregression.Am
Statistician,
44:214–217,1990.
[68]
[68]L.I.Iezzoni.Dim
ensionsofrisk.In
L.I.Iezzoni,editor,RiskAdjustmentforMeasuringHealthOutcomes,chapter2,
pages
29–118.FoundationoftheAmerican
CollegeofHealthcare
Executives,AnnArbor,MI,1994.
[13]
[69]J.
Karvanen
andF.E.Harrell.
Visualizingcovariates
inproportional
hazardsmodel.StatMed,28:1957–1966,2009.
PMID
19378282.
[100]
[70]W.A.Knaus,
F.E.Harrell,
J.Lynn,L.Goldman,R.S.Phillips,
A.F.Connors,
N.V.Daw
son,W.J.
Fulkerson,
R.M.Califf,N.Desbiens,
P.Layde,
R.K.Oye,P.E.Bellamy,
R.B.Hakim
,andD.P.Wagner.TheSUPPORT
prognostic
model:Objectiveestimates
ofsurvival
forseriouslyill
hospitalized
adults.
AnnIntMed,122:191–203,
1995.
[83,157]
[71]C.Kooperberg,C.J.
Stone,
andY.K.Truong.Hazardregression.JAm
StatAssoc,
90:78–94,1995.
[177]
[72]W.F.Kuhfeld.ThePRINQUALprocedure.In
SAS/STAT
9.2
User’sGuide.
SASPublishing,CaryNC,second
edition,2009.
[78]
[73]B.LausenandM.Schumacher.Evaluatingtheeff
ectofoptimized
cutoffvalues
intheassessmentofprognostic
factors.
CompStatDataAnalysis,1996.
[26]
[74]J.
F.Law
less
andK.Singhal.Efficientscreeningofnonnormal
regressionmodels.
Biometrics,34:318–327,1978.
[68]
[75]S.le
CessieandJ.
C.vanHouw
elingen.Ridgeestimatorsin
logisticregression.ApplStat,41:191–201,1992.
[72]
[76]A.Leclerc,D.Luce,F.Lert,
J.F.Chastang,andP.Logeay.
Correspondance
analysis
andlogisticmodelling:
Complementary
use
intheanalysisofahealthsurvey
amongnurses.StatMed,7:983–995,1988.
[77]
[77]S.Lee,J.
Z.Huang,andJ.
Hu.
Sparse
logisticprincipal
components
analysis
forbinarydata.
AnnApplStat,
4(3):1579–1601,2010.
[47]
[78]C.LengandH.Wang.Ongeneraladaptive
sparse
principalcomponentanalysis.JCompGraphStat,18(1):201–215,
2009.
[47]
[79]X.Luo,L.A.Stfanski,andD.D.Boos.
Tuningvariable
selectionproceduresby
addingnoise.
Technometrics,
48:165–175,2006.
[15]
[80]N.Mantel.Whystepdow
nproceduresin
variable
selection.Technometrics,12:621–625,1970.
[68]
[81]S.E.MaxwellandH.D.Delaney.Bivariate
mediansplitsandspuriousstatisticalsignificance.PsychologicalBulletin,
113:181–190,1993.
[26]
[82]G.P.McC
abe.
Principal
variables.
Technometrics,26:137–144,1984.
[76]
[83]G.Michailidis
andJ.
deLeeuw
.TheGifi
system
ofdescriptive
multivariate
analysis.Statistical
Sci,13:307–336,
1998.
[77,77]
BIB
LIO
GRAPHY
189
[84]B.K.MoserandL.P.Coombs.
Oddsratiosforacontinuousoutcomevariable
withoutdichotomizing.StatMed,
23:1843–1860,2004.
[26]
[85]R.H.Myers.Classical
andModernRegressionwithApplications.
PWS-K
ent,Boston,1990.
[73]
[86]N.J.
D.Nagelkerke.
Anote
onageneral
definitionofthecoeffi
cientofdetermination.Biometrika,78:691–692,
1991.
[91]
[87]D.Paul,
E.Bair,
T.Hastie,
and
R.Tibshirani.
“preconditioning”forfeature
selection
and
regression
inhigh-
dim
ensional
problems.
AnnStat,36(4):1595–1619,2008.
[47]
[88]P.Peduzzi,
J.Concato,A.R.Feinstein,andT.R.Holford.
Importance
ofevents
per
independentvariable
inproportional
hazardsregressionanalysis.II.Accuracy
andprecisionofregressionestimates.JClin
Epi,48:1503–
1510,1995.
[69]
[89]P.Peduzzi,J.
Concato,E.Kem
per,T.R.Holford,andA.R.Feinstein.Asimulationstudyofthenumber
ofevents
per
variable
inlogisticregressionanalysis.JClin
Epi,49:1373–1379,1996.
[69,69]
[90]N.Peek,
D.G.T.Arts,R.J.
Bosm
an,P.H.J.
vander
Voort,andN.F.deKeizer.External
validationofprognostic
modelsforcritically
illpatients
required
substantial
sample
sizes.
JClin
Epi,60:491–501,2007.
[92]
[91]M.J.
Pencina,
R.B.D’AgostinoSr,R.B.D’AgostinoJr,andR.S.Vasan.Evaluatingtheadded
predictive
ability
ofanew
marker:
From
area
under
theROCcurveto
reclassificationandbeyond.StatMed,27:157–172,2008.[92]
[92]P.Radchenko
andG.M.James.Variableinclusionandshrinkagealgorithms.JAm
StatAssoc,103(483):1304–1315,
2008.
[46]
[93]D.R.Ragland.
Dichotomizingcontinuousoutcomevariables:
Dependence
ofthemagnitudeofassociationand
statisticalpow
eronthecutpoint.
Epidem
iology,
3:434–440,1992.
[26]
[94]B.M.Reilly
andA.T.Evans.
Translatingclinical
research
into
clinical
practice:Im
pactofusingpredictionrulesto
makedecisions.
AnnIntMed,144:201–209,2006.
[10]
[95]E.B.Roecker.
Predictionerroranditsestimationforsubset-selected
models.
Technometrics,33:459–468,1991.
[67,111]
[96]P.Royston,D.G.Altman,andW.Sauerbrei.Dichotomizingcontinuouspredictors
inmultiple
regression:abad
idea.StatMed,25:127–141,2006.
[26]
[97]W.S.Sarle.TheVARCLUSprocedure.In
SAS/STAT
User’sGuide,
volume2,chapter43,pages
1641–1659.SAS
Institute,Inc.,CaryNC,fourthedition,1990.
[74,76]
[98]W.Sauerbrei
andM.Schumacher.A
bootstrap
resamplingprocedure
formodel
building:Applicationto
theCox
regressionmodel.StatMed,11:2093–2109,1992.
[68,112]
[99]G.Schulgen,B.Lausen,J.
Olsen,andM.Schumacher.Outcome-orientedcutpoints
inquantitative
exposure.Am
JEpi,120:172–184,1994.
[26,28]
[100]J.
Shao.Linearmodel
selectionby
cross-validation.JAm
StatAssoc,
88:486–494,1993.
[112]
[101]L.R.Smith,F.E.Harrell,
andL.H.Muhlbaier.Problemsandpotentialsin
modelingsurvival.In
M.L.Grady
andH.A.Schwartz,editors,
Medical
EffectivenessResearchDataMethods(SummaryReport),AHCPR
Pub.No.
92-0056,
pages
151–159.USDept.
ofHealthandHuman
Services,
Agency
forHealthCarePolicyandResearch,
Rockville,MD,1992.
Available
from
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/FrankHarrell/
smi92pro.pdf.
[69]
[102]I.Spence
andR.F.Garrison.Aremarkable
scatterplot.
Am
Statistician,47:12–19,1993.
[90]
[103]D.J.
Spiegelhalter.
Probabilistic
predictionin
patientmanagem
entandclinical
trials.StatMed,5:421–433,1986.
[71,96,115,116]
BIB
LIO
GRAPHY
190
[104]E.W.Steyerberg.Clinical
PredictionModels.
Springer,New
York,
2009.
[2,192]
[105]E.W.Steyerberg,M.J.
C.Eijkemans,
F.E.Harrell,
andJ.
D.F.Habbem
a.Prognostic
modellingwithlogistic
regressionanalysis:Acomparisonofselectionandestimationmethodsin
smalldatasets.StatMed,19:1059–1079,
2000.
[46]
[106]C.J.
Stone.
Comment:
Generalized
additivemodels.
Statistical
Sci,1:312–314,1986.
[38]
[107]C.J.
StoneandC.Y.Koo.Additivesplines
instatistics.In
ProceedingsoftheStatistical
ComputingSectionASA,
pages
45–48,Washington,DC,1985.
[34,39]
[108]S.Suissa
andL.Blais.Binaryregressionwithcontinuousoutcomes.StatMed,14:247–255,1995.
[26]
[109]G.Sun,T.L.Shook,
andG.L.Kay.
Inappropriate
use
ofbivariable
analysis
toscreen
risk
factorsforuse
inmultivariable
analysis.JClin
Epi,49:907–916,1996.
[70]
[110]R.Tibshirani.Regressionshrinkageandselectionviathelasso.JRoy
StatSocB,58:267–288,1996.
[46]
[111]J.
C.vanHouw
elingen
andS.le
Cessie.
Predictive
valueofstatisticalmodels.
StatMed,9:1303–1325,1990.
[39,
72,72,112,116,117]
[112]P.VerweijandH.C.vanHouw
elingen.Penalized
likelihoodin
Cox
regression.StatMed,13:2427–2436,1994.[72]
[113]A.J.
Vickers.Decisionanalysisfortheevaluationofdiagnostictests,predictionmodels,andmolecularmarkers.Am
Statistician,62(4):314–320,2008.
[8]
[114]E.VittinghoffandC.E.McC
ullo
ch.Relaxingtherule
oftenevents
per
variable
inlogisticandCox
regression.Am
JEpi,165:710–718,2006.
[69]
[115]H.Wainer.
Findingwhat
isnottherethroughtheunfortunatebinningofresults:
TheMendel
effect.
Chance,
19(1):49–56,2006.
[26,29]
[116]H.WangandC.Leng.Unified
LASSO
estimationby
leastsquares
approxim
ation.JAm
StatAssoc,
102:1039–1048,
2007.
[46]
[117]S.Wang,B.Nan,N.Zhou,andJ.
Zhu.Hierarchically
penalized
Cox
regressionwithgrouped
variables.
Biometrika,
96(2):307–322,2009.
[46]
[118]Y.Wax.Collinearity
diagnosisforarelative
risk
regressionanalysis:Anapplicationto
assessmentofdiet-cancer
relationship
inepidem
iological
studies.
StatMed,11:1273–1287,1992.
[74]
[119]J.
Whitehead.Sam
ple
size
calculationsforordered
categorical
data.
StatMed,12:2257–2271,1993.
[69]
[120]R.E.Wiegand.Perform
ance
ofusingmultiple
stepwisealgorithmsforvariable
selection.StatMed,29:1647–1659,
2010.
[68]
[121]D.M.WittenandR.Tibshirani.
Testingsignificance
offeaturesby
lassoed
principal
components.AnnApplStat,
2(3):986–1012,2008.
[47]
[122]S.N.Wood.Generalized
AdditiveModels:
AnIntroductionwithR.Chapman
&Hall/CRC,Boca
Raton,FL,2006.
ISBN
9781584884743.
[47]
[123]C.F.J.
Wu.Jackkn
ife,
bootstrap
andother
resamplingmethodsin
regressionanalysis.AnnStat,14(4):1261–1350,
1986.
[112]
[124]S.Xiong.Somenotesonthenonnegativegarrote.Technometrics,2010.
[47]
[125]J.
Ye.
Onmeasuringandcorrectingtheeff
ects
ofdataminingandmodel
selection.JAm
StatAssoc,
93:120–131,
1998.
[15]
[126]F.W.Young,Y.Takane,
andJ.
deLeeuw
.Theprincipal
components
ofmixed
measurementlevelmultivariate
data:
Analternatingleastsquares
methodwithoptimal
scalingfeatures.
Psychometrika,43:279–281,1978.
[77]
BIB
LIO
GRAPHY
191
[127]H.H.ZhangandW.Lu.Adaptive
lassoforCox’sproportional
hazardsmodel.Biometrika,94:691–703,2007.
[46]
[128]H.Zhou,T.Hastie,
andR.Tibshirani.Sparse
principal
componentanalysis.JCompGraphStat,15:265–286,2006.
[47]
[129]H.ZouandT.Hastie.
Regularizationandvariable
selectionviatheelasticnet.JRoy
StatSocB,67(2):301–320,
2005.
[46]
BIB
LIO
GRAPHY
192
Rpackageswritten
byFEHarrellarefreely
availablefrom
CRAN.
Toobtain
a588-pagebook
withdetailedexam
plesandcase
studiesandnotes
onthetheory
andapplicationsof
survival
analysis,logisticregression,andlinearmodels,order
Regres-
sion
ModelingStrategieswithApplicationsto
LinearMod-
els,LogisticRegression,andSurvivalAnalysisby
FEHarrell
from
SpringerNY(2001).Steyerberg104andDupont3
8are
excellenttextsforaccompanyingthebook.
Toobtainaglossary
ofstatisticaltermsandotherhandoutsrelatedto
diagnosticandprognosticmodeling,
pointyourWeb
brow
serto
biostat.mc.vanderbilt.edu/ClinStat.