1 re-expressing data chapter 6 – normal model –what if data do not follow a normal model? ...
TRANSCRIPT
1
Re-expressing Data Chapter 6 – Normal Model
–What if data do not follow a Normal model?
Chapters 8 & 9 – Linear Model–What if a relationship between
two variables is not linear?
2
Re-expressing Data Re-expression is another
name for changing the scale of (transforming) the data.
Usually we re-express the response variable, Y.
3
Goals of Re-expression Goal 1 – Make the distribution
of the re-expressed data more symmetric.
Goal 2 – Make the spread of the re-expressed data more similar across groups.
4
Goals of Re-expression Goal 3 – Make the form of a
scatter plot more linear. Goal 4 – Make the scatter in
the scatter plot more even across all values of the explanatory variable.
5
Ladder of PowersPower: 2Re-expression:Comment: Use on left skewed
data.
2y
6
Ladder of PowersPower: 1Re-expression:Comment: No re-expression.
Do not re-express the data if they are already well behaved.
y
7
Ladder of PowersPower: ½ Re-expression:Comment: Use on count data
or when scatter in a scatter plot tends to increase as the explanatory variable increases.
y
8
Ladder of PowersPower: “0” Re-expression: Comments: Not really the “0”
power. Use on right skewed data. Measurements cannot be negative or zero.
ylog
9
Ladder of PowersPower: –½, –1 Re-expression: Comments: Use on right
skewed data. Measurements cannot be negative or zero. Use on ratios.
yy
1,
1
10
Goal 1 - Symmetry Data are obtained on the time
between nerve pulses along a nerve fiber.
Time is rounded to the nearest half unit where a unit is of a second.
– 30.5 represents
th
501
sec 61050530 ..
11
.01
.05
.10
.25
.50
.75
.90
.95
.99
-3
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
20
40
60
Cou
nt0 10 20 30 40 50 60 70
Time ( sec)th
501
12
Time – Nerve Pulses Distribution is skewed right. Sample mean (12.305) is much
larger than the sample median (7.5).
Many potential outliers. Data not from a Normal model.
13
.01
.05
.10
.25
.50
.75
.90
.95
.99
-3
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
10
20
30
40
Cou
nt0 1 2 3 4 5 6 7 8 9
Sqrt(Time)
14
.01
.05
.10
.25
.50
.75
.90
.95
.99
-3
-2
-1
0
1
2
3
Nor
mal
Qua
ntile
Plo
t
10
20
30
Cou
nt-1 0 1 2 3 4 5
Log(Time)
15
Summary Time – Highly skewed to the
right. Sqrt(Time) – Still skewed right. Log(Time) –Fairly symmetric
and mounded in the middle.– Could have come from a Normal
model.
16
Goal 3 – Straighten Up What is the relationship
between the temperature of coffee and the time since it was poured?–Y, temperature ( oF)–X, time (minutes)
17
80
90
100
110
120
130
140
150
160
170
180
190
200Te
mp
0 10 20 30 40 50 60
Time (min)
Bivariate Fit of Temp By Time (min)
18
Cooling Coffee There is a general negative
association – as time since the coffee was poured increases the temperature of the coffee decreases.
19
Linear Model
100
110
120
130
140
150
160
170
180
190
Tem
p (F
)
-10 0 10 20 30 40 50 60
Time (min)
20
Linear Model Fit Summary
– Predicted Temp = 176.7 – 1.56*Time
– On average, temperature decreases 1.56 oF per minute.
– R2 = 0.99, 99% of the variation in temperature is explained by the linear relationship with time.
21
Plot of Residuals
-5
-4
-3
-2
-1
0
1
2
3
4
5R
esid
ual
-10 0 10 20 30 40 50 60
Time (min)
22
Curved Pattern There is a clear pattern in the
plot of residuals versus time.–Under predict, over predict,
under predict. The linear fit is very good,
but we can do better.
23
4.5
4.6
4.7
4.8
4.9
5
5.1
5.2
5.3
5.4
5.5
Log(
Tem
p)
-10 0 10 20 30 40 50 60
Time (min)
Linear Fit
Bivariate Fit of Log(Temp) By Time (min)
24
Log(Temp) by Time Summary
– Predicted Log(Temp) = 5.1946 –0.0114*Time
–On average, log temperature decreases 0.0114 log(oF) per minute.
25
Plot of Residuals
-0.010
-0.005
0.000
0.005
0.010
Res
idua
l
-10 0 10 20 30 40 50 60
Time (min)
26
Interpretation There is a random scatter of
points around the zero line. The linear model relating
Log(Temp) to Time is the best we can do.
27
Original Scale? Predicted Log(Temp) = 5.1946 –
0.0114*Time Predicted Temp =
180.3*e–0.0114*Time
– Predicted temp at time=0, 180.3 oF– The predicted temp in one more minute
is the predicted temp now multiplied by e–0.0114 = 0.98866
28
JMP Method 1
–Create a new column in JMP, Log(Temp): Cols – Formula –Transcendental – Log.
29
JMP Method 1 (continued)
–Fit Y by XY – Log(Temp)X – Time
–Fit Linear
30
JMP Method 2
–Fit Y by XY – TempX – Time
–Fit SpecialTransform Y – Log