day 7 model evaluation. elements of model evaluation l goodness of fit l prediction error l bias l...
TRANSCRIPT
![Page 1: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/1.jpg)
Day 7
Model Evaluation
![Page 2: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/2.jpg)
Elements of Model evaluation
Goodness of fit
Prediction Error
Bias
Outliers and patterns in residuals
![Page 3: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/3.jpg)
Assessing Goodness of Fit for Continuous Data
Visual methods - Don’t underestimate the power of your eyes,
but eyes can deceive, too...
Quantification- A variety of traditional measures, all with some
limitations...
A good review...C. D. Schunn and D. Wallach. Evaluating Goodness-of-Fit in Comparison of Models to Data.Source:http://www.lrdc.pitt.edu/schunn/gof/GOF.doc
![Page 4: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/4.jpg)
Traditional inferential tests masquerading as GOF measures
The2 “goodness of fit” statistic- For categorical data only, this
can be used as a test statistic:“What is the probability that the “model” is true, given the observed results”
- The test can only be used to reject a model. If the model is accepted, the statistic contains no information on how good the fit is..
- Thus, this is really a badness – of – fit statistic
- Other limitations as a measure of goodness of fit:» Rewards sloppy research if you are actually trying to
“test” (as a null hypothesis) a real model, because small sample size and noisy data will limit power to reject the null hypothesis
i
ii
e
eo 22 )(
![Page 5: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/5.jpg)
Visual evaluation for continuous data
Graphing observed vs. predicted...
obs = 0.62 + 0.95(pred)
0
5
10
15
20
0 5 10 15 20
Predicted
Obs
erve
d
1:1 line
![Page 6: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/6.jpg)
Source: Canham, C. D., P. T. LePage, and K. D. Coates. 2004. A neighborhood analysis of canopy tree competition: effects of shading versus crowding. Canadian Journal of Forest Research.
Examples
Goodness of fit of neighborhood models of canopy tree growth for 2 species at Date Creek,
BC
Red Cedar
y = 1.0022x
R2 = 0.5963
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5
Predicted
Hemlock
y = 1.0001x
R2 = 0.3402
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5Predicted
Predicted
Obse
rved
![Page 7: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/7.jpg)
Good fitbiased underestimate
y = 1.5146x
R2 = 0.9755
0
5
10
15
20
25
30
35
0 5 10 15 20
Predicted
Ob
serv
ed
Good fit - no bias
y = 1.0013x
R2 = 0.937
0
5
10
15
20
25
0 5 10 15 20
Predicted
Ob
serv
ed
Poor fit - no bias
y = 1.0167x
R2 = 0.6672
0
5
10
15
20
25
0 5 10 15 20
Predicted
Ob
se
rve
dGoodness of Fit vs. Bias
1:1 line
Poor fitbiased overestimate
y = 0.4508x + 5.2182R2 = 0.4787
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20
Predicted
Ob
se
rve
d
1:1 line
![Page 8: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/8.jpg)
R2 as a measure of goodness of fit
R2 = proportion of variance* explained by the model...(relative to that explained by the simple mean of the data)
N
i
2i
N
iii
)obsobs
exp obs
SST
SSER
1
1
2
2
(
)(11
Where expi is the expected value of observation i given the model, and obs is the overall mean of the observations(Note: R2 is NOT bounded between 0 and 1)
* this interpretation of R2 is technically only valid for data where SSE is an appropriate estimate of variance (e.g. normal data)
![Page 9: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/9.jpg)
R2 – when is the mean the mean?
Clark et al. (1998) Ecological Monographs 68:220
N
i
2ji
S
j
N
iijij
S
j
)obsobs
exp obs
R
11
1
2
12
(
)(
1
For i=1..N observations in j = 1..S sites – uses the SITE means, rather than the overall mean, to
calculate R2
![Page 10: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/10.jpg)
r2 as a measure of goodness of fit
n
yy
n
xx
n
yxxy
r2
22
2 )()(
)((
r2 = squared correlation (r) between observed (x) and predicted (y)
0
5
10
15
20
0 5 10 15 20
Predicted
Ob
se
rve
d
R2 = -59.8r2 = 0.699
1:1 line
0
5
10
15
20
0 5 10 15 20
Predicted
Ob
se
rve
d
R2 = 0.905r2 = 0.912
1:1 line
NOTE: r and r2 are both bounded between 0 and 1
![Page 11: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/11.jpg)
R2 vs r2
y = 1.5003x + 5.3709
0
5
10
15
20
25
30
35
40
0 5 10 15 20
Predicted
Ob
se
rve
d
r2 = 0.81R2 = -0.39
1:1 line
Is this a good fit (r2=0.81) or a really lousy fit (R2=-0.39)?(it’s undoubtedly biased...)
![Page 12: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/12.jpg)
A note about notation...
Check the documentation when a package reports “R2” or “r2”. Don’t
assume they will be used as I have used them... Sample Excel output using the
“trendline” option for a chart:
The “R2” value of 0.89 reported by Excel is actually r2 (While R2 is actually 0.21)
(If you specify no intercept, Excel reports true R2...)
y = 0.9603x + 5.3438
R2 = 0.8867
0
5
10
15
20
25
30
0 5 10 15 20
Predicted
Obs
erve
d
![Page 13: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/13.jpg)
R2 vs. r2 for goodness of fit
When there is no bias, the two measures will be almost identical (but I prefer R2, in principle).
When there is bias, R2 will be low to negative, but r2 will indicate how good the fit could be after taking the bias into account...
![Page 14: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/14.jpg)
Sensitivity of R2 and r2 to data range
Entire Range
y = 1.0035x
R2 = 0.8854
0
10
20
30
40
50
0 10 20 30 40
Predicted
Ob
serv
ed
Range = 0-10
y = 1.4397xR2 = 0.448
0
5
10
15
20
25
0 5 10 15Predicted
Ob
se
rve
d
Range = 30 - 40
y = 1.01xR2 = -0.125
0
10
20
30
40
50
0 10 20 30 40 50
Predicted
Ob
serv
ed
![Page 15: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/15.jpg)
The Tyranny of R2 (and r2)
Limitations of R2 (and r2) as a measure of goodness of fit...
- Not an absolute measure (as frequently assumed),
- particularly when the variance of the appropriate PDF is NOT independent of the mean (expected) value
- i.e. lognormal, gamma, Poisson,
Poisson PDF
0.0
0.1
0.2
0.3
0 5 10 15 20 25 30
x
Pro
b(x
)
m = 2.5
m = 5
m = 10
![Page 16: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/16.jpg)
Gamma Distributed Data...
i
ii
y
eyyf 1
)(
1),|(
i
i
iii
yVar
y of value expected yE
2
)(
)(
The variance of the gamma increases as the square of the mean!...
Gamma PDF ( = 1)
0
0.1
0.2
0.3
0.4
0 5 10 15 20
X
p(x
) u = 1
u = 5
u = 10
![Page 17: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/17.jpg)
So, how good is good?
Our assessment is ALWAYS subjective, because of- Complexity of the process being studied
- Sources of noise in the data
From a likelihood perspective, should you ever expect R2 = 1?
![Page 18: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/18.jpg)
Other Goodness of Fit Issues...
In complex models, a good fit may be due to the overwhelming effect of one variable...
The best-fitting model may not be the most “general”
- i.e. the fit can be improved by adding terms that account for unique variability in a specific dataset, but that limit applicability to other datasets. (The curse of ad hoc multiple regression models...)
![Page 19: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/19.jpg)
How good is good: deviance
Comparison of your model to a “full” model, given the probability model.
For i = 1..n observations, a vector X of observed data (xi), and a
vector of j = 1..m parameters (j):
Define a “full” model with n parameters i = xi (full). Then:
)]|()|([2 (D) Deviance XllXll full
n
iixgXL(ll) likelihood-Log
1
)|(ln|ln
Nelder and Wedderburn (1972)
![Page 20: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/20.jpg)
Deviance for normally-distributed data
)2log(2
1)|( 2 nXll full
Log-likelihood of the full model is a function of both sample size (n) and variance (2)
Therefore – deviance is NOT an absolute measure of goodness of fit...
But, it does establish a standard of comparison (the full model), given your sample size and your estimate of the underlying variance...
0
0.2
0.4
0.6
0.8
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
X
Pro
b(x
)
![Page 21: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/21.jpg)
Forms of Bias
N
ii
i
N
ii
exp
)(expobs slope
1
2
1
)(
y = 1.0508x + 8.8405
0
5
10
15
20
25
30
35
40
0 5 10 15 20
Predicted
Ob
se
rve
d
1:1 line
y = 1.48x + 0.9515
0
5
10
15
20
25
30
35
0 5 10 15 20
Predicted
Ob
se
rve
d
1:1 line
Proportional bias (slope not = 1)
Systematic bias (intercept not = 0)
![Page 22: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/22.jpg)
“Learn from your mistakes”(Examine your residuals...)
Residual = observed – predicted
Basic questions to ask of your residuals:
- Do they fit the PDF?
- Are they correlated with factors that aren’t in the model (but maybe should be?)
- Do some subsets of your data fit better than others?
![Page 23: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/23.jpg)
Using Residuals to Calculate Prediction Error
RMSE: (Root mean squared error) (i.e. the standard deviation of the residuals)
1
)(1
2
n
exp obsRMSE
n
iii
![Page 24: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/24.jpg)
Predicting lake chemistry from spatially-explicit watershed data
At steady state:
decay) inlake rate (flushing volume lake
input ionconcentrat
Where concentration, lake volume and flushing rate are observed,
And input and inlake decay are estimated
k decay inlake
disteEinput iji
C
i
P
ji
1 1
![Page 25: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/25.jpg)
Predicting iron concentrations in Adirondack lakes
Adirondack Lake Iron Concentrations
y = 1.0042x
R2 = 0.563
0
20
40
60
80
100
0 20 40 60 80 100
Predicted
Ob
se
rve
d
-40 -30 -20 -10 0 10 20 30 40 50 60RESIDUAL
-3
-2
-1
0
1
2
3
Ex p
ec t
ed V
a lue
for
No
r mal
Dis
trib
utio
n
-50-40-30-20-10
0102030405060
0 20 40 60 80 100
Predicted
Re
sid
ua
l (O
bs
- P
red
)
Results from a spatially-explicit, mass-balance model of the
effects of watershed composition on lake chemistry
Source: Maranger et al. (2006)
![Page 26: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/26.jpg)
Should we incorporate lake depth?
-50
-40
-30
-20
-10
0
10
20
30
40
50
60
0 5 10 15
Lake Depth (m)
Re
sid
ua
ls (
Ob
s -
Pre
d)
•Shallow lakes are more unpredictable than deeper lakes•The model consistently underestimates Fe concentrations in deeper lakes
![Page 27: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/27.jpg)
Adding lake depth improves the model...
Model with depth term included
y = 1.0082x
R2 = 0.6533
0
20
40
60
80
100
0 20 40 60 80 100
Predicted
Ob
se
rve
d
R2 went from 56% to 65%
-30 -20 -10 0 10 20 30 40 50 60DEPTHRESID
-3
-2
-1
0
1
2
3
Ex p
ecte
d V
a lue
for
Nor
ma l
Dis
trib
ut io
n
It is just as important that it made sense to add depth...
![Page 28: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/28.jpg)
But shallow lakes are still a problem...
Model with depth added
-40
-30
-20
-10
0
10
20
30
40
50
60
0 5 10 15
Predicted
Re
sid
ua
l (O
bs
- P
red
)
![Page 29: Day 7 Model Evaluation. Elements of Model evaluation l Goodness of fit l Prediction Error l Bias l Outliers and patterns in residuals](https://reader036.vdocuments.site/reader036/viewer/2022062321/56649e165503460f94b0183a/html5/thumbnails/29.jpg)
Summary – Model Evaluation
There are no silver bullets...
The issues are even muddier for categorical data...
An increase in goodness of fit does not necessarily result in an increase in knowledge…- Increasing goodness of fit reduces uncertainty in the
predictions of the models, but this costs money (more and better data). How much are you willing to spend?
- The “signal to noise” issue: if you can see the signal through the noise, how far are you willing to go to reduce the noise?