michael friendly - yorku math and stats · spida 2004 2 michael friendly. review of linear models...
Post on 14-May-2018
215 Views
Preview:
TRANSCRIPT
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lectu
reO
utlin
e
Intro
du
ction
GeneralLinear
Models:
Overview
SA
Sm
acrosfor
statisticalgraphics
Exp
lorin
gan
dtran
sform
ing
data
Transformations
tosym
metry
Transformations
tolinearity
-resistantlines,B
ox-Cox
Dealing
with
heteroscedasticity
Fittin
gan
du
nd
erstand
ing
linear
mo
dels
Fitting
linearm
odelsw
ithS
AS
Modeldiagnosis:
Leverageand
Influence
Visualizing
influence:P
artialregressionplots
Modelselection
SP
IDA
20041
MichaelFriendly
Review
of
Lin
earM
od
elsan
dM
od
elBu
ildin
gS
trategies
X1 * X
2 InteractionY
= 2*X
1 + -1*X
2 +0.20*X
1*X2
10
0-10
X2
-10
0
10
X1
Y
-50 0 50
SLID
: wgsal4
2c, W
ages a
nd s
ala
ries, 1
994
Wages and salaries - 1994 (Std.)
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
Pow
er
-1/X
-1/S
qrt
Log
Sqrt
Raw
Box-C
ox P
ow
er T
ransfo
rm fo
r Sala
ry
Root Mean Squared Error
200
300
400
500
600
700
800
900
1000
1100
1200
Box-C
ox P
ow
er (
-2-1
01
2
MichaelFriendly
YorkU
niversity,<friendly@yorku.ca>
SP
IDA
June,2004
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Qu
antitative
respo
nse:
Linearm
odelsattem
pttodescribe,predictor
explaina
quantitativeresponse
variable(y
)from
oneor
more
predictor/
explanatoryvariables
( xs)
On
ep
redicto
r:sim
plelinear
regression:•
Q:H
owdoes
ychange
asx
changes?•
M:y
i=
β0
+β
1 xi +
εi
One-w
ayA
NO
VA
:•
Q:H
owdoes
them
eanofy
changeover
levelsoffactor
A?
•M
:yij
=µ
+α
i +εij
Prestige score
10 20 30 40 50 60 70 80 90
Education (years)
67
89
1011
1213
1415
1 6
Medical_tech
Prestige score
10 20 30 40 50 60 70 80 90
Job Type
Blue collar
White collar
Professional
SP
IDA
20043
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategiesReso
urces
SA
Sm
acroprogram
s,fromS
AS
System
forS
tatisticalGraphics
(Friendly,1991):
http://www.math.yorku.ca/SCS/sssg/
http://www.math.yorku.ca/SCS/sasmac/
SA
Sm
acroprogram
s,fromV
isualizingC
ategoricalData
(Friendly,2000):
http://www.math.yorku.ca/SCS/vcd/
Workshop
notes:http://www.math.yorku.ca/SCS/spida/lm/
SP
IDA
20042
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Hom
ogeneityofregression:
•Q
:Isthe
relationofy
onx
thesam
efor
alllevelsoffactor
A?
•M
:yij
k=
β0
+α
i +β
j xij
+(α
β)ij (x
ij )+
εij
k
AN
CO
VA
:•
Q:H
owdoes
them
eanofy
changeover
levelsofA
,controlling(adjusting)
fory
onx
•M
:yij
k=
β0
+α
i +βx
ij+
εij
k
Job Type
Blue collar
White collar
Professional
Prestige score
10 20 30 40 50 60 70 80 90
Education (years)
67
89
1011
1213
1415
16
Job Type
Blue collar
White collar
Professional
Prestige score
10 20 30 40 50 60 70 80 90
Education (years)
67
89
1011
1213
1415
16
SP
IDA
20045
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Two
pred
ictors:
multiple
linearregression:
•Q
:How
doesy
changeas
x1
andx
2change?
•M
:yi=
β0
+β
1 xi1
+β
2 xi2
+εi
Multiple regression response surface
Y =
2*X1 +
-1*X2
10
0-10
X2
-10
0
10
X1
Y
-50 0 50
Multiple regression response surface
Y =
2*X1 +
-1*X2
-10
0
10
X1
-10
010
X2
Y
-50 0 50
Two-w
ayA
NO
VA
:•
Q:H
owdoes
them
eanofy
changeover
levelsoffactors
Aand
B?
•M
:yij
k=
µ+
αi +
βj+
(αβ)ij
+εij
k
SP
IDA
20044
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Ineach
case,we
canrepresentthe
modelin
thesam
eform
:
yi
=β
0+
β1 x
i1+
β2 x
i2+
···+β
p xip
+εi
response=
wtd.
sumofpredictors
+residual
data=
explained(partialsum
mary)
+unexplained
where
thex
scan
be:
Quantitative
regressors:age,incom
e,educationTransform
edregressors: √
age,log(income)
Polynom
ialregressors:age
2,age3,···
Categoricalpredictors:
treatment,sex—
codedas
“dumm
y”(0/1)
variablesInteraction
regessors:treatm
ent×age,sex×
ageA
nycom
binationsofthe
above⇒the
GeneralLinear
Model
“Linearm
odel”→linear
inthe
parameters,β
1 ,β2 ,β
3 ,...,e.g.,
yi=
β0
+β
1 age+
β2 age
2+
β3log(incom
e)+
β4 (sex=
’F’)
+εi
SP
IDA
20047
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Response
surfacem
odels:•
Q:Is
therelation
ofyto
x1
andx
2linear?
•M
:yi=
β0
+β
1 xi1
+β
11 x
2i1+
β2 x
i2+
β22 x
2i2+
εi
Models
with
interactions:•
Q:Is
therelation
ofyto
x1
thesam
efor
allx2
?•
M:y
i=
β0
+β
1 xi1
+β
2 xi2
+β
12 x
i1 xi2
+εi
Quadratic response surface
Y =
2*X1 +
-1*X2 +
0.5*X1*X
1+-0.2*X
2*X2
10
0-10
X2
-10
0
10
X1
Y
-50 0 50
X1 * X
2 InteractionY
= 2*X
1 + -1*X
2 +0.20*X
1*X2
10
0-10
X2
-10
0
10
X1
Y
-50 0 50
SP
IDA
20046
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:A
ssum
ptio
ns
Inthe
GLM
,forvalidity
ofinference,we
mustm
akesom
eassum
ptions(the
“Holy
Trinity”):
Ind
epen
den
ce:allerrors,ε
i ,ε ′iare
statisticallyindependent:
Cov(ε
i ,ε ′i )=
0K
eydifference
between
ordinaryG
LMs
vs.Mixed
models,H
LMs,
repeated/longitudinalmodels
Co
nstan
terro
rvarian
ce:V
ar(εi )≡
Var(y
i |xi )
=σ
2=
constantK
eydifference
between
ordinaryG
LMs
vs.logisticregression,P
oissonregression
andthe
Generalized
LinearM
odelN
orm
alityo
ferro
rs:T
heerrors,ε
i ,havea
normaldistribution
εi
∼N
IDN(0
,σ2)
ε∼
N(0
,σ2I)
Inaddition,w
eim
plicitlyassum
e:
Mo
delsp
ecificatio
n:
thecorrectx
shave
beenincluded,each
inthe
correctformF
ixedx
s:the
predictorvariables
arem
easuredw
/oerror—
rarelytrue!
SP
IDA
20049
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Gen
eralLin
earM
od
els:O
verview
Allofthese
canbe
representedin
matrix
form,
y=
Xβ
+ε
(1)
or,y1...yn
= 1
x11
x12
···x
1p
1x
21
x22
···x
2p
······
···...
···1
xn1
xn2
···x
np
β0...βp
+ ε1...εn
(2)
Inallcases,
Param
eterestim
ates:β
=(X
TX
) −1X
Ty
Residuals
=estim
atederrors
=e
=y−
y=
y−X
βR
esidualvariance:M
SE≡
Var(ε)
=(e
Te)/(n−
p−1)
Standard
errors:V
ar(β)
=M
SE(X
TX
) −1
Param
etertests:
H0
:β
i=
0⇒t=
βi / √
Var(β
i )∼t(n−
p−1)
SP
IDA
20048
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Sid
ebar:
Usin
gS
AS
macro
s
E.g.,the
SYMBOX
macro
isdefined
with
thefollow
ingargum
ents:
symbox.sas
···1%macrosymbox(
2data=_last_,
/*
name
of
input
data
set
*/
3var=,
/*
name(s)
ofthe
variable(s)
to
examine
*/
4id=,
/*
name
of
IDvariable
*/
5out=symout,
/*
name
of
output
data
set
*/
6orient=V,
/*
orientation
of
boxplots:
Hor
V*/
7powers=-1-0.50
.5
1,
/*
list
of
powers
to
consider
*/
8name=symbox
/*
name
for
graph
in
graphics
catalog
*/
9);
Typicaluse:
1%symbox(data=baseball,
2var=SalaryRuns,
/*analysisvariables
*/
3id=name,
/*playerID
variable
*/
4powers=-1
-.50.51
2);
SP
IDA
200411
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Sid
ebar:
Usin
gS
AS
macro
s
SA
Sm
acrosare
high-level,generalprograms
consistingofa
seriesofD
ATA
stepsand
PROC
steps.
Keyw
ordargum
entssubstitute
yourdata
names,variable
names,and
optionsfor
thenam
edm
acroparam
eters.
Use
as:
%macname(data=dataset,var=variables,...);
e.g.,%boxplot(data=nations,var=imr,class=region,id=nation);
Mostargum
entshave
defaultvalues(e.g.,d
ata=last
)
AllS
SS
Gand
VC
Dm
acroshave
internaland/oronline
documentation,
http://www.math/yorku.ca/SCS/sssg/
http://www.math/yorku.ca/SCS/sasmac/
http://www.math/yorku.ca/SCS/vcd/
Macros
canbe
installedin
directoriesautom
aticallysearched
byS
AS
.Putthe
following
options
statementin
yourAUTOEXEC.SAS
file:
options
sasautos=(’c:\sasuser\macros’
sasautos);
SP
IDA
200410
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mu
ltivariated
isplays
BIPLOT
Generalized
biplotofobservationsand
variablesCORRGRAM
Draw
acorrelogram
OUTLIER
Robustm
ultivariateoutlier
detectionSCATMAT
Scatterplotm
atrixSTARS
Star
plotform
ultivariatedata
Lin
earan
dG
eneralized
Lin
earM
od
els
ADDVAR
Added
variableplots
forlogistic
regressionBOXCOX
Pow
ertransform
ationsby
Box-C
oxm
ethod(PROC
REG
)BOXGLM
Pow
ertransform
ationsby
Box-C
oxm
ethod(PROC
GLM
)BOXTID
Pow
ertransform
ationsby
Box-T
idwellm
ethodCPPLOT
Plots
ofMallow
’sC
(p)and
relatedstatistics
form
odelselectionDUMMY
Constructdum
my
variablesfor
regressionm
odelsHALFNORM
Half-norm
alplotsfor
generalizedlinear
models
INFLGLIM
Influenceplots
forgeneralized
linearm
odelsINFLOGIS
Influenceplots
forlogistic
regressionINFLPLOT
Influenceplotfor
regressionm
odelsINTERACT
Create
interactionvariables
MEANPLOT
Plotm
eansfor
factorialdesignsPARTIAL
Partialresidualand
partialregressionplots
ROBUST
Robustfitting
forlinear
models
(REG
,GLM
,LOGISTIC
)via
IRLS
RSQDELTA
Com
puteR
-squarechange
andF
-statisticsin
regressionSPRDPLOT
Spread-Levelplotto
findtransform
ationto
equalizevariances.
SP
IDA
200413
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
SS
SG
Macro
s&
pro
gram
s
Macros
availableath
ttp://www.math.yorku.ca/SCS/sssg/
,.../SCS/sasmac/
Un
ivariated
isplays
BOXPLOT
Box-and-w
hiskerplots
DATACHK
Basic
datascreening
fornum
ericvariables
NQPLOT
Norm
alQQ
plotSYMBOX
Boxplots
fortransform
ationsto
symm
etrySYMPLOT
Diagnostic
plotsfor
transformations
tosym
metry
Bivariate
disp
lays
CONTOUR
Plotellipticalcontours
forX
,Ydata
LOWESS
Locallyw
eightedscatterplotsm
ootherRESLINE
Resistantline
forbivariate
dataSUNPLOT
Sunflow
erplotfor
X-Y
data
SP
IDA
200412
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tosym
metry
Transformations
haveseveraluses
indata
analysis,including:
making
adistribution
more
symm
etric.equalizing
variability(spreads)
acrossgroups.
making
therelationship
between
two
variableslinear.
These
goalsoften
coincide:a
transformation
thatachievesone
goalwilloften
helpfor
another(butnotalw
ays).
Som
etools
(Friendly,1991):
Understanding
theladder
ofpowers.
SYMBOX
macro
-boxplots
ofdatatransform
edto
variouspow
ers.SYMPLOT
macro
-various
plotsdesigned
toassess
symm
etry.P
OW
ER
plot:line
with
slopeb⇒
y→y
p,where
p=
1−b
(roundedto
0.5).BOXCOX
macro
-for
regressionm
odel,transformy→
yp
tom
inimize
MS
E(or
maxim
umlikelihood);influence
plotshows
impactofobservations
onchoice
ofpow
er(B
oxand
Cox,1964).
BOXGLM
macro
-for
GLM
(anova/regression),transformy→
yp
tom
inimize
MS
E(or
max.
likelihood)BOXTID
macro
-for
regression,transformx
i →x
pi(B
oxand
Tidw
ell,1962).
SP
IDA
200415
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Part
2:E
xplo
ring
and
transfo
rmin
gd
ata
Transformations
tosym
metry
Transformations
tolinearity
-resistantlines,B
ox-Cox
Dealing
with
heteroscedasticity
SP
IDA
200414
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
For
simplicity:
usuallyuse
onlysim
pleinteger
andhalf-integer
powers
(sometim
es,
p=
1/3→
3 √x
)
scalethe
valuesto
keepresults
simple.
Pow
erTransform
ationR
e-expression
3C
ubex
3/100
2S
quarex
2/10
1N
ON
E(R
aw)
x
1/2S
quareroot
√x
0Log
log10x
-1/2R
eciprocalroot−
10/ √
x
-1R
eciprocal−
100/x
SP
IDA
200417
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
–L
add
ero
fP
ow
ers
Pow
ertransform
ationsare
oftheform
x→x
p.
Ausefulfam
ilyoftransform
ationsis
ladderofpow
ers(Tukey,1977),defined
as
x→tp (x),
tp (x)
= {x
p−1
pp�=
0log
10x
p=
0(3)
Key
ideas:
log(x)plays
therole
ofx0
inthe
family—
halfway
between
-1/2(−
1/ √
x)
and1/2
( √x
)
1/p→
keepsorder
ofxthe
same
forp
<0
,e.g.,p=
−1→
−1/x
SP
IDA
200416
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lad
der
of
Po
wers
–E
xamp
le
SLID
(Ontario
subset)-
Wages
andS
alaries
SYMBOX
macro
-transform
sa
variableto
alistofpow
ers,showstandardized
scoresusing
theBOXPLOT
macro
title’SLID:wgsal42c,Wagesandsalaries,1994’;
%symbox(data=slid.pontario,
var=wgsal42c,
/*variable*/
powers=-1-0.500.51);
/*listof
powers*/
SLID
: wgsal4
2c, W
ages a
nd s
ala
ries, 1
994
Wages and salaries - 1994 (Std.)
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
Po
we
r-1
/X-1
/Sq
rtL
og
Sq
rtR
aw
SP
IDA
200419
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lad
der
of
Po
wers
—P
rop
erties
Preserve
the
ord
ero
fd
atavalu
es.Larger
datavalues
onthe
originalscalew
ill
belarger
onthe
transformed
scale.(T
hat’sw
hynegative
powers
havetheir
sign
reversed.)
Th
eych
ang
eth
esp
acing
of
the
data
values.
Pow
ersp
<1
,suchas √
xand
logx
compress
valuesin
theupper
tailofthe
distributionrelative
tolow
values;
Pow
ersp
>1
,suchas
x2,have
theopposite
effect,expandingthe
spacingof
valuesin
theupper
endrelative
tothe
lower
end.
Sh
ape
of
the
distribu
tion
chan
ges
systematically
with
p.
If √x
pullsin
the
uppertail,log
xw
illdoso
more
strongly,andnegative
powers
willbe
strongerstill.
Req
uires
allx>
0.
Ifsome
valuesare
negative,adda
constantfirst,i.e.,
x→tp (x
+c)
Has
aneffectonly
iftheran
ge
ofx
values
ism
od
eratelylarg
e.
SP
IDA
200418
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lad
der
of
Po
wers
–E
xamp
le
SLID
(Ontario
subset)-
Hourly
wage
SYMBOX
macro
-transform
sa
variableto
alistofpow
ers,showstandardized
scoresusing
theBOXPLOT
macro
title’SLID:cmphw28c,Comp.hourlywage,1994’;
%symbox(data=slid.pontario,var=cmphw28c,
powers=-1-0.500.51);
SLID
: cm
phw
28c, C
om
p. h
ourly
wage, 1
994
Comp. hrly wage all jobs - 1994 (Std.)-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Po
we
r-1
/X-1
/Sq
rtL
og
Sq
rtR
aw
SP
IDA
200421
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
SLID
: wgsal4
2c, W
ages a
nd s
ala
ries, 1
994
Wages and salaries - 1994 (Std.)
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
Po
we
r-1
/X-1
/Sq
rtL
og
Sq
rtR
aw
wgsal42c→
log(wgsal42c)
looksbest.
SP
IDA
200420
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tolin
earity
Brain
weightand
bodyw
eightofmam
mals:
Marginalboxplots
showthatboth
variablesare
highlyskew
ed
Mostpoints
bunchedup
atorigin
Relation
isstrongly
non-linear
Logtransform
removes
bothproblem
s
Brain w
eight and body weight of m
amm
alsBrain weight
0
1000
2000
3000
4000
5000
6000
Body w
eight0
10002000
30004000
50006000
7000
Brain w
eight and body weight of m
amm
als
log10 (Brain weight)-1 0 1 2 3 4
log10 (Body w
eight)-3
-2-1
01
23
4
SP
IDA
200423
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
SLID
: cm
phw
28c, C
om
p. h
ourly
wage, 1
994
Comp. hrly wage all jobs - 1994 (Std.)-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Po
we
r-1
/X-1
/Sq
rtL
og
Sq
rtR
aw
cmphw
28c→√
orlog(cmphw
28c)looks
OK
.
See
http://www.math.yorku.ca/SCS/sasmac/symbox.html
SP
IDA
200422
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tolin
earity
Tukey’sarrow
ruleand
thedouble
ladderofpow
ers:
Draw
anarrow
inthe
directionofthe
“bulge”.
The
arrowpoints
inthe
directionto
move
alongthe
ladderofpow
ersfor
xor
y(or
both).
(a)
(b)
(c)
(d)
...
log
sqrt
raw
Y2
Y3
...
...lo
gsqrt
X2
X3
...
SP
IDA
200425
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tolin
earity
Ifyis
aresp
on
se(“dependent”)
andx
isa
predictor,we
oftenw
anttofit
y=
f(x)+
residual
Generally
we
prefera
“simple”f(x),like
alinear
function,y=
a+
bx
+residual.
Iftherelation
between
yand
xis
substantiallynon-linear,w
ehave
two
choices:
Ben
dth
em
od
el:Try
fittinga
quadratic,cubic,orother
polynomial(easy:
linearin
parameters),or
elsea
non-linearm
odel,e.g., y=
aexp(bx)
(harder).
Un
ben
dth
ed
ata:Transform
eithery→
y ′,orx→
x ′(or
both),sothatrelation
islinear,
y ′=a
+bx ′+
residual
Ladderofpow
ersand
Tukey’s“arrow
rule”indicate
which
directionto
go.
A“ratio
ofslopes”table
pinpointsgood
power
transformations.
SP
IDA
200424
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Ratio
ofslopes
The
curvatureofthe
datacan
bem
easuredby
theratio
ofslopes
r=
upperslope
lower
slope=
(yH−
yM
)/(xH−
xM
)(y
M−
yL )/(x
M−
xL )
e.g.,
XY
half-slope
ratio
High
4600.6275157.515
1.1058
Mid
10.000
80.996
0.1391
7.9465
Low
0.122
2.500
r < 1
r = 1
r > 1
Y5
10
15
XLow
Mid
dle
Hig
h
Y5
10
15
XLow
Mid
dle
Hig
h
Y5
10
15
XLow
Mid
dle
Hig
h
Alinear
relation⇒r≈
1(orlog
r≈0
)
SP
IDA
200427
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tolin
earity
Resistantlines
andthe
ratioofslopes
table(Tukey,1977):
Leastsquaresregression
cangive
misleading
resultsw
ithhighly
skewed
dataor
with
outliers
Aresistantline
oftendoes
betterw
ithill-behaved
data
Sum
mary
values–
medians
ofthirds,dividingby
X-values
(butneitherend-third
cancover
more
than1/2
therange)
SummaryValues
XY
n
Low
0.122
2.500
21
Mid
10.000
80.996
39
High
4600.6275157.515
2R
SP
IDA
200426
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
For
thisdata,values
ofr≈1
tendto
runalong
thediagonal
log-logis
thebestcom
bination
-----RatioofSlopestable------
Rowsarepowersof
X,columnsarepowersofY
-1.0
-0.5
log
sqrt
raw
2.0
-1.0
2.544
15.127
96.908
687.0705247.745329241.7
-0.5
0.265
1.575
10.089
71.527
546.31434275.54
log
0.023
0.134
0.858
6.085
46.4772915.947
sqrt
0.001
0.008
0.052
0.368
2.813
176.504
raw
0.000
0.000
0.003
0.018
0.139
8.731
2.0
0.000
0.000
0.000
0.000
0.000
0.019
-------5
Bestpowers-------
PowerofX
PowerofY
SlopeRatio
logRatio
log
log
0.858
-0.066
-0.5
-0.5
1.575
0.197
-1.0
-1.0
2.544
0.405
sqrt
sqrt
0.368
-0.434
sqrt
raw
2.813
0.449
See
http://www.math.yorku.ca/SCS/sasmac/resline.html
SP
IDA
200429
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
The
effectofanytransform
ation,x→x
p,y→y
q,canbe
judgedby
theeffectithas
onthe
ratioofslopes,
r(p
,q)=
(yqH−
yqM
)/(xpH−
xpM
)(y
qM−
yqL )/(x
pM−
xpL )
The
resline
macro
calculatesthe
ratioofslopes
fora
setofpowers
ofxand
ofy%resline(data=brains,
x=bodywt,
y=brainwt,
id=mammal);
SP
IDA
200428
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Taiw
an
Zam
bia
Papua N
ew
Guin
ea
Lebanon
Saudi A
rabia
Lib
ya
IMR
data
: Resid
uals
from
log-lo
g fit
Residual
-1.0
-0.5
0.0
0.5
1.0
1.5
log In
com
e1
23
4
SP
IDA
200431
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
tolin
earity
Infantmortality
rateand
per-capitaincom
e
Arrow
pointstow
ardlow
erpow
ersofx
and/ory
Ratio
ofslopessuggestlog
x,log
y
IMR
vs. P
er C
ap
ita In
co
me
Infant Mortality Rate
0
100
200
300
400
500
600
700
Per C
apita
Incom
e0
1000
2000
3000
4000
5000
6000
IMR
da
ta: lo
g-lo
g fit
log Infant Mortality
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
log In
com
e1.5
2.0
2.5
3.0
3.5
4.0
SP
IDA
200430
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-C
oxTran
sform
ation
s
Baseballdata:
predictingS
alaryfrom
Years,RB
Ic,HIT
Sc.
CI(λ
)includes
λ=
0→log(S
alary)
Effects
plotshowst
statisticfor
eachregressor
The
boxcox
macro
providesthe
RM
SE
,EF
FE
CT
S,and
INF
Lplots:
basecox.sas
1title’Box-CoxtransformationforBaseballsalary’;
2%includedata(baseball);
3%boxcox(data=baseball,
4id=name,
/*
playerID
*/
5resp=Salary,
/*
response
*/
6model=YearsHITScRBIc,/*
predictors*/
7gplot=RMSEEFFECTINFL);
SP
IDA
200433
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-C
oxTran
sform
ation
s
Another
way
toselectan
“optimal”
transformation
ofyin
regressionis
toadd
a
parameter
forthe
power
tothe
model,
y(λ
)=
Xβ
+ε
where
λis
thepow
erofy
in(the
‘ladder’)
y(λ
)= {
yλ−
1λ
,λ�=
0log
y,
λ=
0
Box
andC
ox(1964)
proposeda
maxim
umlikelihood
procedureto
estimate
the
power
(λ)
alongw
iththe
regressioncoefficients
(β).
This
isequivalentto
minim
izing √M
SE
overchoices
ofλ.⇒
fitthem
odelfora
rangeofλ
(-2to
+2,say)
The
maxim
umlikelihood
method
alsoprovides
a95%
confidenceintervalfor
λ.
Plotofpartialt
orF
foreach
regressorvs.λ→
sensitivityto
power.
SP
IDA
200432
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-C
oxTran
sform
ation
s:G
LM
s
For
grouping(C
LAS
S)
predictors,we
cando
thesam
eanalysis,using
PROC
GLM
.
The
BOXGLM
macro
handlesm
odelsw
ithCLASS
variables
pontboxglm.sas
1title’SLID:cmphw28c,Hourlywages,1994’;
2%boxglm(data=slid.pontario,
3resp=cmphw28c,
4model=YrSch18ceAge26cSex21MoTn2g15,
5class=sex21motn2g15,
6lopower=-1.6,
7gplot=RMSEEFFECT);
Wages:
cmphw
28c→log(cm
phw28c)
See
http://www.math.yorku.ca/SCS/sasmac/boxglm.html
SP
IDA
200435
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Bo
x-C
ox P
ow
er T
ran
sfo
rm fo
r Sa
lary
Root Mean Squared Error
20
0
30
0
40
0
50
0
60
0
70
0
80
0
90
0
10
00
11
00
12
00
Bo
x-C
ox P
ow
er (
-2-1
01
2
Years
HIT
Sc
RB
Ic
t-va
lue
s fo
r Mo
de
l Effe
cts
on
Sa
lary
t-value
-6 -5 -4 -3 -2 -1 0 1 2 3 4 5
Bo
x-C
ox P
ow
er (
-2-1
01
2
See
http://www.math.yorku.ca/SCS/sasmac/boxcox.html
SP
IDA
200434
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-C
ox:S
core
testan
din
flu
ence
plo
t
Ascore
testisbased
onthe
slopeofthe
logL
functionatλ
=1
(slope≈0↔
atmaxim
um)
For
Box-C
ox,thiscan
beform
ulatedas
thet
statisticfor
aconstructed
variable,g,
gi=
yi (log
yiy−
1)
where
yis
thegeom
etricm
eanofthe
yi .
Fitthe
model y
=X
β+
φg
.
TestH0
:φ
=0
(↔λ
=1).
Another
estimate
ofλis
1−φ
.
Apartialregression
plotforgshow
sthe
influenceofindividualobservations
onthe
choiceofthe
transformation.
SP
IDA
200437
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-C
ox P
ow
er T
ransfo
rm fo
r cm
phw
28c
Root Mean Squared Error
5 6 7 8 9
10
11
Box-C
ox P
ow
er (
-2-1
01
2
YrS
ch
18
c e
Ag
e2
6c
Se
x2
1
Mo
Tn
2g
15
F-v
alu
es fo
r Model E
ffects
on c
mphw
28c
F-value
0
100
200
300
400
500
600
700
800
900
1000
1100
Box-C
ox P
ow
er (
-2-1
01
2
SP
IDA
200436
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Transfo
rmatio
ns
of
pred
ictors
Inany
correlationalanalysis(e.g.,regression,factor
analysis)w
ecan
getasim
ple
overviewofthe
relationsby
Plotting
allpairsofvariables
together(scatmat
macro)
Draw
inga
quadraticregression
curvefor
eachpair
%scatmat(...,interp=rq
).
“curves”w
illbestraightw
henthe
relationsare
linear.
(lowess
fitsare
better,butmore
computationally
intensive.)
SP
IDA
200439
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Baseballdata:
predictingS
alaryfrom
Years,RB
Ic,HIT
Sc.
The
influenceplotshow
sthata
fewplayers
arestrongly
determining
thechoice
ofpower,butthey
arenotoutofline
with
therest.
The
slope(φ
)again
leadsto
thechoice
λ=
0⇒log
y
Plotproduced
bythe
BOXCOX
macro
(with
GP
LOT
=IN
FL):
Mu
rray
Ric
e
Sch
mid
t
Sm
ith
Slo
pe
: 0.9
39
Po
we
r: 0
Partia
l Regre
ssio
n In
fluence p
lot fo
r Box-C
ox p
ow
er
Partial Salary
-20
00
-10
00 0
10
00
20
00
Pa
rtial C
on
stru
cte
d V
aria
ble
-10
00
01
00
02
00
0
SP
IDA
200438
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-T
idw
ellTransfo
rmatio
ns
Box
andT
idwell(1962)
suggesteda
modelto
determine
transformations
ofthe
Xs,
y=
β0
+β
1 xγ1
1+
···βk x
γk
k+
ε
Param
etersofthis
model—
β0 ,β
1...β
k ,γ1...γ
kcan
beestim
atedby:
1.R
egressy
onx
1 ,...,xk →
b0 ,b
1 ,...bk
.
2.C
reateconstructed
variables,x1log
x1 ,...x
klog
xk
.
3.R
egressy
onx
1 ,...,xk ,
x1log
x1 ,...x
klog
xk
→b ′0 ,b ′1 ,...b ′k ,g
1 ,...gk
4.E
stimate
ofthepow
erγ
iis
givenby
γ=
1+
gi /
bi
5.R
epeatsteps3,4
untilγconverge
(givesM
LE).
The
constructedvariables,x
i logx
i ,canbe
usedto
testtheneed
fora
transformation
ofxi :
TestH0
:γ
i=
1from
testofcoefficientofxi log
xi=
0.
Partialregression
plotsfor
theconstructed
variableshelp
toassess
theleverage
andinfluence
onthe
decisionto
transforman
xvariable.
SP
IDA
200441
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
e.g.,Canadian
occupationalprestige:%
wom
en,income,education
Pre
stig
e
14
.8
87
.2
Wo
me
n
0
97
.51
Ed
uc
6.3
8
15
.97
Inco
me
61
1
25
87
9
→P
restigenon-linear
w.r.t.
Educ
andIncom
e
SP
IDA
200440
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
...and
(score)tests
forpow
ertransform
ations
Scoretestsforpowertransformations
Power
StdErr
ScoreZProb>|Z|
EDUC
2.2109
4.9114
2.4097
0.0160
INCOME
-0.0426
0.0000
-5.2625
0.0000
Pow
ersare
roundedto
thenearest0.5:
Educ→
Educ
2,Incom
e→log
Income.
SP
IDA
200443
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-T
idw
elltransfo
rmatio
ns:
Exam
ple
Canadian
OccupationalP
restige–
findpow
ersfor
Educ
andIncom
e
The
BOXTID
macro
carriesoutthis
procedure:
%boxtid(data=prestige,
yvar=Prestige,
id=job,
xvar=WomenEducIncome,
/*
varsin
model
*/
xtrans=EducIncome,
/*
varsto
xform
*/
round=.5,
/*
roundpowers
*/
out=boxtid);
/*
outputdataset*/
Printed
resultsshow
theiteration
history...
IterationHistory:TransformationPowers
Iteration
EDUC
INCOME
Criterion
12.2551
-0.9132
1.9132
22.3790
0.8273
1.9059
32.3593
-0.6834
1.8261
42.3221
0.4444
1.6503
...
13
2.2109
-0.0426
0.0005
SP
IDA
200442
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-T
idw
elltransfo
rmatio
ns:
Exam
ple
The
BOXTID
macro
createsthe
transformed
variablesfor
you(e.g.,t
_income
).
Plotw
ithLOWESS
macro,adding
linearregression
lines:
%lowess(data=boxtid,x=t_educ,y=prestige,id=job,
f=.667,interp=rl);
%lowess(data=boxtid,x=t_income,y=prestige,id=job,
f=.667,interp=rl);
Plots
ofPrestige
vs.E
duc2
andlog(Incom
e)show
thatbothvariables
arenow
approx.linearly
relatedto
Prestige.
SP
IDA
200445
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Box-T
idw
elltransfo
rmatio
ns:
Exam
ple
Partialregression
plotsfor
thetransform
edvariables
showthatseveral
observationsare
influentialforthe
choiceofpow
erfor
Income.
BT
po
we
r: 2
Ge
ne
ral_
ma
na
ge
rs
Vo
ca
tion
al_
co
un
s
Min
iste
rs
Ph
ysic
ian
sVe
terin
aria
ns
Fa
rme
rs
Partial Prestige
-20
-10 0
10
20
Pa
rtial C
on
stru
cte
d V
aria
ble
(Ed
uc)
-0.8
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
BT
po
we
r: 0
Ge
ne
ral_
ma
na
ge
rs
Vo
ca
tion
al_
co
un
s
Min
iste
rs
Ph
ysic
ian
s
Ve
terin
aria
ns
Fa
rme
rs
Partial Prestige
-20
-10 0
10
20
Pa
rtial C
on
stru
cte
d V
aria
ble
(Inco
me
)-2
00
0-1
00
00
10
00
20
00
30
00
40
00
50
00
60
00
70
00
SP
IDA
200444
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity
Classicallinear
models
(AN
OV
A,regression)
assume
constantresidualvariance
y=
Xβ
+ε
,V
ar(ε)=
Var(y|X
)=
σ2
=constant
Diag
no
sis:
AN
OV
A:exam
inevariability
(IQR
,std.dev.)
ofresidualsby
groups
•P
lotmeans±
1std.
error(meanplot
macro)
•B
oxplotsofresiduals
vs.predicted(boxplot
macro)
•S
preadvs.
levelplots—P
lotlog(IQR)
vs.log(Med)
(sprdplot
macro)
Regression:
examine
variability(IQ
R,std.
dev.)ofresiduals
byx
ory
•D
ividex
ory
intogroups
(e.g.,deciles)—plots
asfor
AN
OV
A
•S
preadvs.
levelplots:P
lotlog(|ei |)
vs.log(x)
Treating
the
disease:
Fix
thedata:
Variance
stabilizingtransform
ation,y→y
p
Fix
them
odel:
•W
LSestim
ation(w
eights,wi ∼
1/σ
2i)
•U
sea
generalizedlinear
model
SP
IDA
200447
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Prestige score
10
20
30
40
50
60
70
80
90
Sq
ua
red
Ed
uc
01
00
20
03
00
Prestige score
10
20
30
40
50
60
70
80
90
Lo
g In
co
me
67
89
10
11
The
lowesttw
ooccupations
onlog(Incom
e)should
belooked
atmore
closely.
SP
IDA
200446
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Surv
ival tim
es o
f anim
als
: Means
Tre
atm
en
tA
BC
D
Survival time (hrs)
2 3 4 5 6 7 8 9
PO
ISO
N1
23
Surv
ival tim
es o
f anim
als
: Resid
uals
vs. P
red
RE
SID-4 -3 -2 -1 0 1 2 3 4 5
Pre
dic
t2
34
56
78
9
Both
plotsshow
greatervariance
associatedw
ithlonger
survivaltime.
Why
shouldw
enotbe
surprised?
SP
IDA
200449
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity:A
NO
VA
Survivaltim
eofanim
als:E
xposedto
poison,thengiven
treatment(B
oxand
Cox,
1964)
Plotm
eans±1
std.error
(meanplot
macro)
Boxplots
ofresidualsvs.predicted
(boxplot
macro)
Trick:values
ofyconstantw
ithincells
animals.sas
1%meanplot(data=animals,
2class=poisontreatmt,
/*factors
*/
3response=time);
/*response*/
45*--Fitfull2-waymodel,getoutputdataset;
6procglmdata=animals;
7classpoisontreatmt;
8modeltime=
poison|
treatmt;
9outputout=resultsp=predictr=resid;
1011*--Boxplotofresidualsvs.predicted;
12%boxplot(data=results,class=Predict,var=resid);
SP
IDA
200448
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Slo
pe: 2
.00
Pow
er: -1
.0
A1
A2
A3
B1
B2
B3
C1
C2
C3
D1
D2
D3
Spre
ad - L
evel p
lot
log Spread
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
log
Me
dia
n tim
e0
.30
.40
.50
.60
.70
.80
.91
.0
Meanplo
t of 1
/Tim
e
Tre
atm
en
tA
BC
D
1/TIME
-50
-40
-30
-20
-10
PO
ISO
N1
23
The
plotsuggeststransform
ingT
ime→
1/Tim
e.
1/Tim
ealso
reducesapparentinteraction
ofPoison
*Treatm
ent
SP
IDA
200451
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity:A
NO
VA
Spread
vs.levelplots
(thesprdplot
macro)
Plotlog(spread)
vs.log(level)
e.g.,log(IQR
)vs.
log(Median)
Ifalinear
relationexists,w
ithslope
b,transformy→
yp,w
ithp
=1−
b.
···animals.sas
14%sprdplot(data=animals,
15class=poisontreatmt,
16var=time);
/*createst_time*/
1718*--Plotmeansoftransformedvariables;
19%meanplot(data=animals,
20class=poisontreatmt,
21response=t_time);
SP
IDA
200450
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Use
Spread
vs.levelploton
groupedx
···baseba.sas
6%sprdplot(data=grouped,
7class=decile,var=Salary);
8%boxplot(data=grouped,
9class=decile,var=logsal,id=name);
Slo
pe: 0
.97
Pow
er: 0
.0222
237.5
248 2
54
259
265
274
280
287
300
log Spread
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
log
Me
dia
n S
ala
ry
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
Kennedy
Schm
idt
Sm
ith
Sie
rra
Pasqua
Sax
Cla
rk
log Salary
4.5
5.0
5.5
6.0
6.5
Ba
tting
Ave
ra
ge
De
cile
01
23
45
67
89
logS
alaryis
againindicated
SP
IDA
200453
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity:R
egressio
n
Divide
anx
variableinto
orderedgroups
(e.g.,deciles)
baseba.sas
···1procrankdata=baseballout=groupedgroups=10;
2varbatavgc;
3ranksdecile;
Salary (in 1000$)
0
1000
2000
3000
Ca
re
er B
attin
g A
ve
ra
ge
180
200
220
240
260
280
300
320
340
360
Schofie
ld
Kennedy
Schm
idt
Virgil S
undberg
Brunansky
Sm
ithM
urphyW
infie
ld
Salary (1000$)
0
10
00
20
00
30
00
Me
dia
n B
attin
g A
ve
ra
ge
22
02
30
24
02
50
26
02
70
28
02
90
30
0
SP
IDA
200452
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity:C
om
plex
mo
dels
Fitm
odel,getfittedvalues
(y)
inoutputdataset
Divide
intoordered
groupsbased
onfitted
valueS
pread-levelplotoflog(IQ
R)
vs.log(M
edian)e.g.,S
LID,predicting
TotalWages
andS
alaries
pontwages.sas
1procglmdata=pontario;
2classsex21;
3modelttwgs28c=
sex21
4eage26c|eage26c
/*
Age,Age^2
*/
5yrsch18c|yrsch18c
/*
Yearsof
schooling&
^2*/
6vismn15;
/*
Visibleminority?
*/
7outputout=stats
r=residualp=fitted;
8run;
9procrankdata=statsout=groupedgroups=10;
10varfitted;
11ranksdecile;
1213%sprdplot(data=grouped,var=ttwgs28c,class=decile);
SP
IDA
200455
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dealin
gw
ithh
eterosced
asticity:R
egressio
n
Spread
vs.levelplots:
Plotlog(|e
i |/σ)
vs.log(x)Iflinear,w
ithslope
b,transformy→
yp,w
ithp
=1−
b.
Slope: 1.19
Pow
er: 0
log | RSTUDENT |-2 -1 0 1
log (X)
0.51.0
1.52.0
Artificialdata,generated
sothatσ
∼x
:P
ower
=0→
analyzelog(y)
SP
IDA
200454
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Part
3:F
itting
and
un
derstan
din
glin
earm
od
els
Fitting
linearm
odelsw
ithS
AS
Modeldiagnosis:
Leverageand
Influence
Visualizing
influence:P
artialregressionplots
Modelselection
SP
IDA
200457
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategiesSlope: 0.56
Pow
er: 0.5
.
0
1
2
34
5
67
8
9
SLID
: Model for T
otal wages and salaries
log IQR, ttwgs28c
3.7
3.8
3.9
4.0
4.1
4.2
4.3
4.4
4.5
log Median, ttw
gs28c
3.53.6
3.73.8
3.94.0
4.14.2
4.34.4
4.54.6
4.7
→A
nalysisof √
wages
SP
IDA
200456
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Fittin
glin
earm
od
elsin
SA
S:PROC
GLM
CLASS
statementfor
discretepredictors→
dumm
y(0/1)
variables
proc
glm
data=...;
classA
BC;
modely
=A;
/*one-way
ANOVA
*/
modely
=A
BC;
/*3-way,
main
effects
only
*/
modely
=A
|B
|C
@2;
/*3-way,
all
two-way
terms
*/
modely
=A
|B
|C;
/*full
3-way
ANOVA
*/
Nested
effects:“B
within
A”→
B(A
)
proc
glm
data=...;
classprov
districtschool;
modelreading=
prov
district(prov)school(districtprov);
“Mixed”
effects:discrete
andcontinuous
predictors
proc
glm
data=...;
classA
BC;
modely
=A
X;
/*one-way
ANCOVA
*/
modely
=A|B
X;
/*two-way
ANCOVA
*/
modely
=A
X(A);
/*separate
slopes
model
*/
modely
=A
|X;
/*test
equal
slopes,
A*X
*/
SP
IDA
200459
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Fittin
glin
earm
od
elsin
SA
S:PROC
GLM
PROC
GLM
One
(orm
ore)quantitative
responsevariable(s)
Multiple
responsevariables→
multivariate
analysesor
repeatedm
easures
GLM
modelsyntax:
regressioneffects
(covariates)
proc
glm
data=...;
modely
=X1;
/*
simple
linear
regression
*/
modely
=X1
X2
X3;
/*
multiple
linear
regression
*/
modely
=X1-X5;
/*
multiple
linear
regression
*/
modely
=wages--education;
/*
multiple
linear
regression
*/
modely
=X1
X1*X1
X1*X1*X1;
/*
polynomial
regression
*/
modely
=X1
X2
X1*X2;
/*
interaction
model
*/
modely
=X1
X2
X1*X1
X2*X2X1*X2;
/*
response
surface
*/
Bar
notation:A|
B|
C→A
BCA*B
A*C
B*C
A*B*C
proc
glm
data=...;
*--
same,
using
’|’
notation;
modely
=X1
|X1
|X1;
/*
polynomial
regression
*/
modely
=X1
|X2;
/*
interaction
model
*/
modely
=X1
|X1
|X2
|X2
@2;
/*
response
surface
*/
SP
IDA
200458
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Fittin
glin
earm
od
elsin
SA
S:PROC
REG
PROC
REG
One
(orm
ore)quantitative
responsevariable(s)
+E
xtensivefacilities
forregression
diagnostics
+M
odelselectionm
ethods:stepw
ise,forward,backw
ard
+PLOT
statement→
plotsofany
dataor
computed
variables
proc
glm
data=...;
modely
=X1
X2
X3
//*
MRA,
influence
stats
*/
influencepartial;
plotnqq.
*r.;
/*
Normal
plot
*/
modely
=X1-X5/
/*
MRA,
model
selection
*/
selection=
stepwisesle=0.10;
SP
IDA
200461
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Fittin
glin
earm
od
elsin
SA
S:PROC
GLM
REPEATED
statementfor
repeatedm
easuresanalysis—
univariateand
multivariate
tests
proc
glm
data=...;
classA
B;
modelt1-t4
=A
|B
/nouni;
/*2
Between,
1Within-S
factor
*/
repeatedtrials4
polynomial;
/*classical
univar.
analysis
*/
manovah=A
|B;
/*MANOVA
tests
*/
Mixed
andrandom
effectsm
odels
proc
glm
data=...;
classperson
age
sex;
randomperson;
/*
person
random,
nested
w/in
sex
*/
modely
=age|sex
age|person(sex);
test
h=sex
e=person(sex);
/*specify
error
terms
*/
test
h=ageage*sexe=age*person(sex);
/*specify
error
terms
*/
Handled
betterinPROC
MIXED
SP
IDA
200460
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lin
earm
od
elsin
SA
S:
Oth
erp
roced
ures
PROC
RSREG
:R
esponsesurface
models
Autom
aticallygenerates
allsquaredterm
s,x21 ,x
22 ,...,andinteraction
effects,
x1 x
2 ,x1 x
3 ,....→
simple
way
totestfor
(quadratic)non-linearity
andinteractions.
PROC
SURVEYREG
:R
egressionfor
sample
surveydata
Handles
complex
surveydesigns:
stratification,clustering,unequalweighting
PROC
LIFEREG
:Linear
models
forfailure-tim
edata
Response
canbe
left,rightorintervalcensored
More
generalerrordistributions
(extreme
value,exponential,...)
PROC
TRANSREG
:Linear
models
with
variabletransform
ations
Quantitative
variables:splines,response
surface,powers,ranks,...
Discrete
variables:dum
my
(CLASS
),optimalcategory
scores,...
SP
IDA
200463
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Fittin
glin
earm
od
elsin
SA
S:PROC
REG
−no
CLASS
statement—
mustcreate
dumm
yvariables
(DUMMY
macro)
−no
|notation—
mustcreate
interactionterm
s(INTERACT
macro)
data
test;
input
xy
group$
sex
$@@;
cards;
510
AM
812
AF
913
AM
10
18
BM
16
19
BM
10
16
BF
15
21
CM
13
19
CF
15
20
CM
;*--
Dummy
variables
for
Sex
and
Group;
%dummy(data=test,var
=sex
group,prefix=Sex_Gp_);
*--
Interaction
of
X*Sex;
%interact(data=test,v1=x,v2=Sex_F,names=XSex);
proc
printnoobs;run;
Produces:x
ygroup
sex
SEX_F
GP_A
GP_B
XSex
510
AM
01
00
812
AF
11
08
913
AM
01
00
10
18
BM
00
10
16
19
BM
00
10
10
16
BF
10
110
15
21
CM
00
00
13
19
CF
10
013
15
20
CM
00
00
SP
IDA
200462
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Un
usu
aldata:
Leverag
ean
dIn
flu
ence
“Unusual”
observationscan
havedram
aticeffects
onleastsquares
estimates
in
linearm
odels
Three
archtypicalcases:
•TypicalX
(lowleverage),bad
fit
•U
nusualX(high
leverage),goodfit
•U
nusualX(high
leverage),badfit
Influentialobservations:unusualin
bothX
andY.
Heuristic
formula:
Influence=
XLeverage
×Y
residual
SP
IDA
200465
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Lin
earm
od
elsin
SA
S:
Oth
erp
roced
ures
PROC
LOGISTIC
:Logistic
Regression
Logitandprobitm
odelsfor
binaryresponse
data
Models
forordinal
discreteresponses
PROC
GENMOD
:G
eneralizedlinear
models
Classicallinear
models,logitistic
andprobitm
odels(binary
data),log-linear
models,...
Analysis
ofcorrelateddata
viaG
eneralizedE
stimating
Equations
(GE
E)
PROC
MIXED
:M
ixedm
odels
Generalizes
standardG
LMto
providefor
correlatederrors
andnonconstant
variance
Provides
form
odellingboth
responsem
eans(fixed
effects)and
variance-covarianceparam
eters(random
effects)
Com
mon
scenarios:clustered/hierarchicaldata,and
repeated/longitudinaldata
SP
IDA
200464
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Dram
aticexam
ple:D
avis’dataon
reportedand
measured
weightofm
enand
wom
en
Se
lf-Re
po
rts o
f He
igh
t an
d W
eig
ht
Se
x o
f su
bje
ct
FM
Reported weight in Kg
40
50
60
70
80
90
10
0
11
0
12
0
13
0
Me
asu
red
we
igh
t in K
g2
04
06
08
01
00
12
01
40
16
01
80
SP
IDA
200467
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
y
20 30 40 50 60 70
x10
2030
4050
6070
80
y
20 30 40 50 60 70
x10
2030
4050
6070
8 0
y
20 30 40 50 60 70
x10
2030
4050
6070
80
y
20 30 40 50 60 70
x10
2030
4050
6070
8 0
Original data
O-
Low leverage, O
utlier
-LH
igh leverage, good fit
OL
High leverage, O
utlier
SP
IDA
200466
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Detectin
go
utliers:
Stu
den
tizedR
esidu
als
Ord
inary
residu
als:ei=
yi −
yi ,notusefulbecause:
Even
iferrors,εi
haveconstantvariance
(asassum
ed),residualsdo
not—
varianceofe
ivaries
inverselyw
ithleverage—
Var(e
i )=
σ2(1−
hi )
Outliers
onY
pulltheregression
line(surface)
toward
them
Stu
den
tizedresid
uals:
Standardized
residual(RS
TU
DE
NT
)calculated
fory
ideleting
observationi.
Using
subscript(−i)
tom
eandeleting
i,
RS
TU
DE
NT≡
e�i
=ei
s(−
i) √1−
hi
Gives
atestfor
“mean-shift”
outlierm
odel,H0
:E(y
i |X)�=
E(y
(−i) |X
)e�i ∼
t(n−p−
2)•→
|e�i |
>t1−
α/2 (n−
p−2)
signifcantapriori
•→
|e�i |
>t1−
α/2n (n−
p−2)
signifcantaposteriori
(Bonferroni)
SP
IDA
200469
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Measu
ring
Leverag
e
Leverag
e:m
easuredby
“Hatvalues,”
hi .
so-calledbecause
fittedvalues
canbe
expressedas
y=
Hy
For
simple
linearregression,h
i ∼(x−
x)2
For
ppredictors,h
i ∼squared
distanceofx
ifrom
centroid, x(M
ahalanobis
squareddistance)
Allhatvalues
rangefrom
1/n
to1,and
averageis
h=
(p+
1)/n
.
→observations
with
hi>
2h
(orh
i>
3h
insm
allsamples)
aretypically
considered“high
leverage”points
SP
IDA
200468
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Infl
uen
ced
iagn
ostics
with
SA
S
PROC
REG
influence
optionon
model
statementgives
printedvalues
inflplot
macro
Fits
modelusing
PROC
REG
,influencestatistics→
outputdataset
Plots
RS
TU
DE
NT
vs.Hatvalue,bubble
size∼C
ook’sD
orD
FF
ITS
Labels“notew
orthy”observations—
largeR
ST
UD
EN
Tand/or
Hatvalue
Show
nominalcutoffs
for“unusual”
values
Sim
ilarm
acros
inflogis
macro—
logisticregression
(PROC
LOGISTIC
)
inflglim
macro—
generalizedlinear
models
(PROC
GENMOD
)
See:h
ttp://www.math.yorku.ca/SCS/sssg/inflplot.html
SP
IDA
200471
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Infl
uen
ce=
Leverag
e×R
esidu
al
Co
ok’s
D:
Scale-invariant(squared
)m
easureof“distance”
between
β(all)
and
β(−
i)(deleting
obs.i)
CO
OK
Di ≡
Di= (
e2i
(p+
1)s2 )
×h
i
1−h
2i
“Large”values:
Di>
4/n
[orD
i>
4/(n−
p−1)]
DF
FIT
S:
Scaled
measure
of(signed)
changein
predictedvalue
fory
i ,deleting
obs.i
DF
FIT
Si=
yi −
y(−
i)
s(−
i) √h
i
= (ei
s(−
i) )×
√h
i
1−h
2i
“Large”values:|D
FF
ITS
i |>
2 √(p
+1)/
n
SP
IDA
200470
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Minister
Reporter
RR
Conductor
Contractor
RR
Enginee r
Duncan data: Influence P
lotB
ubble size: Cook’s D
istance
Studentized Residual-3 -2 -1 0 1 2 3 4
Leverage (Hat V
alue).02
.04.06
.08.10
.12.14
.16.18
.20.22
.24.26
.28
SP
IDA
200473
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Exam
ple:
Du
ncan
’sO
ccup
ation
alPrestig
eD
ata
PROC
REG
step,with
influence
option
duncinfl2.sas
···1%includedata(duncan);
2procregdata=duncan;
3modelprestige=IncomeEduc/
influence;
4idjob;
5run;
inflplot
macro:
···duncinfl2.sas
6title’Duncandata:InfluencePlot’;
7title2"Bubblesize:Cook’sDistance";
8%inflplot(data=duncan,
9y=Prestige,
/*response
*/
10x=IncomeEduc,
/*predictors
*/
11id=job,
/*ID
variable
*/
12bubble=cookd
/*bubble~Cook’sD*/
13);
SP
IDA
200472
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Partialreg
ression
plo
ts
Pro
blem
s
Correlated
predictors—O
rdinaryscatterplots
cannotshowthe
uniqueeffects
of
onepredictor,controlling
forothers
Jointinfluence—S
ingledeletion
diagnosticscannotshow
whether
setsof
observationsare
jointlyinfluential,or
offseteach
other
So
lutio
n:
Partialreg
ression
(add
ed-variab
le)p
lots
For
xk
,ploty|otherx
svs.x
k |otherx
s.(others≡
X[−
k])
y|others≡y
�k=
y−y
X[−
k]
xk |others≡
x�k
=x−
xX
[−k]
y�k
=residuals
fromregression
ofyon
X[−
k]x
�k=
residualsfrom
regressionofx
kon
X[−
k]→
uniquerelation
ofyto
xk
,controlling/adjustingfor
allotherx
s.
SP
IDA
200475
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Exam
ple:
Du
ncan
’sO
ccup
ation
alPrestig
eD
ata
Influenceon
coefficientsis
substantial:
Alln
=45
cases
ParameterEstimates
Parameter
Standard
Variable
Label
DF
Estimate
Error
tValue
Pr
>|t|
Intercept
Intercept
1-6.06466
4.27194
-1.42
0.1631
income
Income
10.59873
0.11967
5.00
<.0001
educ
Education
10.54583
0.09825
5.56
<.0001
Deleting
Minister,R
RC
onductor,RR
Engineer
ParameterEstimates
Parameter
Standard
Variable
Label
DF
Estimate
Error
tValue
Pr
>|t|
Intercept
Intercept
1-6.31736
3.67962
-1.72
0.0939
income
Income
10.93066
0.15375
6.05
<.0001
educ
Education
10.28464
0.12136
2.35
0.0242
SP
IDA
200474
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Partialreg
ression
plo
ts:E
xamp
le
PROC
REG
step,with
partial
option→printer
plots
duncan4.sas
···1%includedata(duncan);
2procregdata=duncan;
3modelprestige=IncomeEduc/
partial;
4idjob;
5run;
partial
macro:
high-resplots
···duncan4.sas
6%partial(data=duncan,
7yvar=Prestige,
/*
response
*/
8xvar=IncomeEduc,
/*
predictors
*/
9id=job,
/*
IDvariable
*/
10label=INFL
/*
labelinfluentialpts*/
11);
SP
IDA
200477
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Partialreg
ression
plo
ts:P
rop
erties
slopeofy
�kon
x�k
=bk
,theestim
ateofthe
(partial)regression
coefficient,βk
,in
thefullm
odel.
residualsfrom
theregression
linein
thisplot≡
residualsfory
inthe
fullmodel,i.e.,
y�k
=bk x
�k+
e
simple
correlationbetw
eeny
�kand
x�k
=partialcorrelation
between
yand
xk
with
theother
xvariables
partialledoutor
controlled.
plotshows
partialleverage
(∼x
�ik2)
andinfluence
SP
IDA
200476
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
-40-20
020
4060
-40
-20 0 20 40 60
Partial incom
e
Partial prestige
Minister
Reporter
RR
Conductor
Contractor
RR
Engineer
Minister
andR
RC
onductorare
jointlyinfluential—
decreaseslope
forIncom
e
SP
IDA
200479
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
-60-40
-200
2040
60-60
-40
-20 0 20 40 60 80
Partial educ
Partial prestige
Minister
Reporter
RR
Conductor
Contractor
RR
Engineer
Minister
andR
RC
onductorare
jointlyinfluential—
increaseslope
forE
ducation
SP
IDA
200478
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
3Dview
:D
eleting 3 influential cases
100 50
0
Education
0
50
100
Income
Prestige
-6
34
75
115
SP
IDA
200481
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
3Dview
:
Minister
Reporter
RR
Conductor
Contractor
RR
Engineer
Observations and F
itted Response S
urface
100 50
0
Education
0
50
100
Income
Prestige
-6
32
70
108
SP
IDA
200480
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
n:
Exam
ple
Resp
on
se:F
uelconsumption
(percapita)
inU
Sstates
Pred
ictors:
TAX
statetax
rateon
motor
fuel
DRIVERS
proportionoflicensed
drivers
ROAD
lengthoffederalhighw
ays
INC
percapita
personalincome
POP
populationfuelcp.sas
1%includedata(fuel);
2%cpplot(data=fuel,
3yvar=fuel,
/*
response
*/
4xvar=taxdriversroadincpop,
/*
predictors
*/
5gplot=CPFAIC,
/*
whattoplot*/
6plotchar=TD
RIP
/*
varlabels
*/
7);
SP
IDA
200483
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
nfo
rL
inear
Mo
dels
Op
po
sing
go
als
WantR
2large
orM
SE
small→
includem
anyvariables
Wantparsim
ony,simplicity
ofinterpretation,smallcostofdata
collection→include
fewvariables
Selectio
ncriteria
R2
=S
Sm
odelS
Stotal
•C
annotdecreaseas
pincreases
Mallow
’sC
p=
SS
EP
MS
Efull −
(n−2P
)•
Measures
totalsquarederror
(randomerror
+bias)
usingP
=p
+1
parameters
outofmavailable
inthe
fullmodel
•C
pdirectly
relatedto
incrementalF
pfor
testingpredictors
omitted
from
model,H
0:β
p+
1=
···=β
m=
0•
“Good”
models
(nobias):
Cp ≈
Por
Fp ≈
1•
Modelcom
parisons:plotC
por
Fp
vs.P(CPPLOT
macro)
SP
IDA
200482
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
CPPLOT
macro:
Fp
vs.pplot
D P T
DI
TD
DP
DR
TP
RP
TR
TD
I
TD
P
DR
ID
RP
DIP
TD
R
TR
P
TIP
TD
IP
TD
RI
TD
RP
DR
IP
Fuel C
onsumption across the U
S
F for Omitted Variables
0 10 20
Num
ber of parameters in m
odel
23
45
6
SP
IDA
200485
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
CPPLOT
macro:
Cp
vs.pplot
DP
TI
R
DI
TD
DP
DR
TP
RP
TR
TI
IPR
I
TD
I
TD
P
DR
ID
RP
DIP
TD
RT
RP
TIP
RIP
TR
I
TD
IP
TD
RI
TD
RP
DR
IP
TR
IP
TD
RIP
Fuel C
onsumption across the U
SMallows C(p)
0 10 20
Num
ber of parameters in m
odel
23
45
6
SP
IDA
200484
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
CPPLOT
macro:
AIC
vs.pplot
D P T I R
DI
TD
DP
DR TP
RP
TR TIIP R
I
TD
I
TD
P
DR
ID
RP
DIP
TD
R
TR
PT
IP
RIP
TR
I
TD
IPT
DR
IT
DR
PD
RIP
TR
IP
TD
RIP
Fuel C
onsumption across the U
S
Akaike’s information criterion
400
410
420
430
440
450
460
Num
ber of parameters in m
odel
23
45
6
SP
IDA
200487
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
nfo
rL
inear
Mo
dels
Parsim
ony
measu
res
Com
binebadness
offit(SS
E)
with
penaltyfor
more
parameters
(p)
•A
IC-
Akaike’s
Information
Criterion:
AIC
=n
ln(SS
E/n)
+2p
•B
IC-
Saw
a’sB
ayesianInform
ationC
riterion:larger
penalty
Sm
alleris
better
No
p-values—
theyare
modelcom
parisonstatistics,rather
thanteststatistics
Can
did
atem
od
els
All
criteriaand
modelselection
methods
shouldbe
consideredas
nominating
a
fewcandidate
models,to
beexplored
more
fully
Things
tow
orryabout:
•Influentialobservations—
afew
casescan
determine
choiceof“best”
model
•M
odelmisspecification—
nonlineareffects,om
ittedpredictors
SP
IDA
200486
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
np
roced
ures
Au
tom
atedselectio
n
Forw
ardselection
•S
tartwith
none
•A
teachstep,add
variablew
ithlargest
incrementalF
-value,if
Pr(F
)<
SLE
NT
ER
•U
ntil:no
remaining
variablehas
Pr(F
)<
SLE
NT
ER
Backw
ardelim
ination
•S
tartwith
all
•A
teachstep,rem
ovevariable
with
smallest
incrementalF
-value,if
Pr(F
)>
SLS
TAY
•U
ntil:allrem
ainingvariables
haveP
r(F)
>S
LSTAY
Stepw
iseselection
•S
tartwith
two
forward
steps
•A
lternate:forw
ardstep,backw
ardstep
ifPr(F
)>
SLS
TAY
•U
ntil:N
onecan
beadded
orrem
oved
Sound
good,butallaredangerous!
SP
IDA
200489
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
np
roced
ures
Allp
ossib
lereg
ression
s2
p−1
ofthem,e.g.,p
=10⇒
1023m
odelsR
ank“best”
candidatem
odelsvia
R2,C
p ,AIC
1proc
reg
data=fuel;
2modelfuel
=tax
driversroad
inc
pop
/3
selection=rsquarebest=3cp
aic;
R-SquareSelectionMethod
Number
in
Model
R-Square
C(p)
AIC
Variablesin
Model
10.4886
27.2658
423.68
drivers
10.2141
65.5021
444.30
pop
10.2037
66.9641
444.93
tax
--------------------------------------------------------------
20.6175
11.2968
411.73
driversinc
20.5567
19.7727
418.82
tax
drivers
20.5382
22.3532
420.78
driverspop
--------------------------------------------------------------
30.6749
5.3057
405.93
tax
driversinc
30.6522
8.4600
409.17
tax
driverspop
30.6249
12.2636
412.79
driversroad
inc
--------------------------------------------------------------
40.6956
4.4172
404.77
tax
driversinc
pop
40.6787
6.7723
407.37
tax
driversroad
inc
40.6687
8.1598
408.83
tax
driversroad
pop
--------------------------------------------------------------
50.6986
6.0000
406.30
tax
driversroad
inc
pop
SP
IDA
200488
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
But,influence
plotshows
severalinfluentialcases:
CA
IL
NVN
Y
RI
SD
TX
WY
Fuel C
onsum
ptio
n: In
fluence P
lot
Studentized Residual-3 -2 -1 0 1 2 3 4 5
Le
ve
rag
e (H
at V
alu
e)
.00
.10
.20
.30
.40
.50
.60
SP
IDA
200491
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Mo
delselectio
np
roced
ures
1procreg
data=fuel;
2modelfuel
=tax
driversroadinc
pop
/3
selection=stepwise;
Sum
mary
Output:
All
variablesleftin
the
modelare
significantat
the
0.1500level.
No
other
variablemet
the
0.1500significancelevel
for
entry.
Summaryof
StepwiseSelection
Variable
Partial
Model
Step
Entered
R-Square
R-Square
C(p)
FValue
Pr
>F
1drivers
0.4886
0.4886
27.265
43.94
<.0001
2inc
0.1290
0.6175
11.296
15.17
0.0003
3tax
0.0573
0.6749
5.305
7.76
0.0078
4pop
0.0207
0.6956
4.417
2.93
0.0942
Allcriteria
andprocedures
agreeon
thisas
bestor2nd
best
What’s
nottolike?
SP
IDA
200490
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Using
log(P
opulationdensity)
=log
(pop
/area
)alone:
R2
=.73
!
CA
NV
NY
RI
SD
TX
WY
Fuel data: C
onsidering population density
Fuel consumption (/person)
300
400
500
600
700
800
900
1000
log Population D
ensity
-3-2
-10
Adding
drivers
:R
2=
.79—
asim
ple,sensible,andinterpretable
model
SP
IDA
200493
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Partialresidualplots
suggestmodelm
isspecification:
fuel
was
measured
percapita—
why
shouldpop
matter?
→C
aliforniaaccounts
form
ostofeffectofpop
-.1-.1
.0.1
.1.2
-200
-100 0
100
200
300
400
Partial drivers
Partial fuel
CA
IL
NV
NY
RI T
X
WY
-5000-2500
02500
50007500
1000012500
-200
-100 0
100
200
300
Partial pop
Partial fuel
CA
IL
NV
NY
RI
TX
WY
Why
isfuel
under-predictedfor
Wyom
ing,over-predictedfor
Rhode
Island?
→suggests
tolook
atpopulationdensity
SP
IDA
200492
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Referen
cesB
ox,G.E
.P.andC
ox,D.R
.A
nanalysis
oftransformations
(with
discussion).Journalofthe
RoyalS
tatisticalSociety,S
eriesB
,26:211–252,1964.
Box,G
.E.P.and
Tidw
ell,P.W.
Transformation
oftheindependentvariables.
Technometrics,4:
531–550,1962.
Em
erson,J.D.and
Stoto,M
.A.
Exploratory
methods
forchoosing
power
transformations.
JournaloftheA
merican
StatisticalA
ssociation,77:103–108,1982.
Friendly,M.
SA
SS
ystemfor
StatisticalG
raphics.S
AS
Institute,Cary,N
C,1stedition,1991.
Friendly,M.
Visualizing
CategoricalD
ata.S
AS
Institute,Cary,N
C,2000.
Tukey,J.W.
Exploratory
Data
Analysis.
Addison
Wesley,R
eading,MA
,1977.
SP
IDA
200495
MichaelFriendly
Review
ofLinearM
odelsand
ModelB
uildingS
trategies
Data
analysis
=S
um
marizatio
n+
Exp
osu
re+
Too
ls+
Un
derstan
din
g
Effective
data
analysis
requ
iresS
um
marizatio
nan
dE
xpo
sure:
Su
mm
arization
:P
arameter
estimates
(β1 ,β
2 ,···)H
ypothesistests
(H0
:β
1=
0)
Com
parisons&
differences(w
ithstandard
errors!)
Exp
osu
re:V
isualizedata
andpredicted
valuesU
nderstandpatterns
andtrends
Detectanom
alies
Effective
data
analysis
requ
iresTo
ols
and
Un
derstan
din
g
Too
ls:S
tatisticalmodels
andm
ethodsfor
complex
problems
Com
putationalandgraphicalm
ethodsm
ustbeavailable
andeasy
touse
Un
derstan
din
g:
Whatm
odelscan
do,andw
hattheycan’t
Whatto
changew
henthings
gow
rong
SP
IDA
200494
MichaelFriendly
top related