day 7 model evaluation. elements of model evaluation l goodness of fit l prediction error l bias l...

Day 7

Model Evaluation

Elements of Model evaluation

Goodness of fit

Prediction Error

Bias

Outliers and patterns in residuals

Assessing Goodness of Fit for Continuous Data

Visual methods - Don’t underestimate the power of your eyes,

but eyes can deceive, too...

Quantification- A variety of traditional measures, all with some

limitations...

A good review...C. D. Schunn and D. Wallach. Evaluating Goodness-of-Fit in Comparison of Models to Data.Source:http://www.lrdc.pitt.edu/schunn/gof/GOF.doc

Traditional inferential tests masquerading as GOF measures

The2 “goodness of fit” statistic- For categorical data only, this

can be used as a test statistic:“What is the probability that the “model” is true, given the observed results”

- The test can only be used to reject a model. If the model is accepted, the statistic contains no information on how good the fit is..

- Thus, this is really a badness – of – fit statistic

- Other limitations as a measure of goodness of fit:» Rewards sloppy research if you are actually trying to

“test” (as a null hypothesis) a real model, because small sample size and noisy data will limit power to reject the null hypothesis

i

ii

e

eo 22 )(

Visual evaluation for continuous data

Graphing observed vs. predicted...

obs = 0.62 + 0.95(pred)

0

5

10

15

20

0 5 10 15 20

Predicted

Obs

erve

d

1:1 line

Source: Canham, C. D., P. T. LePage, and K. D. Coates. 2004. A neighborhood analysis of canopy tree competition: effects of shading versus crowding. Canadian Journal of Forest Research.

Examples

Goodness of fit of neighborhood models of canopy tree growth for 2 species at Date Creek,

BC

Red Cedar

y = 1.0022x

R2 = 0.5963

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5

Predicted

Hemlock

y = 1.0001x

R2 = 0.3402

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5Predicted

Predicted

Obse

rved

Good fitbiased underestimate

y = 1.5146x

R2 = 0.9755

0

5

10

15

20

25

30

35

0 5 10 15 20

Predicted

Ob

serv

ed

Good fit - no bias

y = 1.0013x

R2 = 0.937

0

5

10

15

20

25

0 5 10 15 20

Predicted

Ob

serv

ed

Poor fit - no bias

y = 1.0167x

R2 = 0.6672

0

5

10

15

20

25

0 5 10 15 20

Predicted

Ob

se

rve

dGoodness of Fit vs. Bias

1:1 line

Poor fitbiased overestimate

y = 0.4508x + 5.2182R2 = 0.4787

0

2

4

6

8

10

12

14

16

18

0 5 10 15 20

Predicted

Ob

se

rve

d

1:1 line

R2 as a measure of goodness of fit

R2 = proportion of variance* explained by the model...(relative to that explained by the simple mean of the data)

N

i

2i

N

iii

)obsobs

exp obs

SST

SSER

1

1

2

2

(

)(11

Where expi is the expected value of observation i given the model, and obs is the overall mean of the observations(Note: R2 is NOT bounded between 0 and 1)

* this interpretation of R2 is technically only valid for data where SSE is an appropriate estimate of variance (e.g. normal data)

R2 – when is the mean the mean?

Clark et al. (1998) Ecological Monographs 68:220

N

i

2ji

S

j

N

iijij

S

j

)obsobs

exp obs

R

11

1

2

12

(

)(

1

For i=1..N observations in j = 1..S sites – uses the SITE means, rather than the overall mean, to

calculate R2

r2 as a measure of goodness of fit

n

yy

n

xx

n

yxxy

r2

22

2 )()(

)((

r2 = squared correlation (r) between observed (x) and predicted (y)

0

5

10

15

20

0 5 10 15 20

Predicted

Ob

se

rve

d

R2 = -59.8r2 = 0.699

1:1 line

0

5

10

15

20

0 5 10 15 20

Predicted

Ob

se

rve

d

R2 = 0.905r2 = 0.912

1:1 line

NOTE: r and r2 are both bounded between 0 and 1

R2 vs r2

y = 1.5003x + 5.3709

0

5

10

15

20

25

30

35

40

0 5 10 15 20

Predicted

Ob

se

rve

d

r2 = 0.81R2 = -0.39

1:1 line

Is this a good fit (r2=0.81) or a really lousy fit (R2=-0.39)?(it’s undoubtedly biased...)

A note about notation...

Check the documentation when a package reports “R2” or “r2”. Don’t

assume they will be used as I have used them... Sample Excel output using the

“trendline” option for a chart:

The “R2” value of 0.89 reported by Excel is actually r2 (While R2 is actually 0.21)

(If you specify no intercept, Excel reports true R2...)

y = 0.9603x + 5.3438

R2 = 0.8867

0

5

10

15

20

25

30

0 5 10 15 20

Predicted

Obs

erve

d

R2 vs. r2 for goodness of fit

When there is no bias, the two measures will be almost identical (but I prefer R2, in principle).

When there is bias, R2 will be low to negative, but r2 will indicate how good the fit could be after taking the bias into account...

Sensitivity of R2 and r2 to data range

Entire Range

y = 1.0035x

R2 = 0.8854

0

10

20

30

40

50

0 10 20 30 40

Predicted

Ob

serv

ed

Range = 0-10

y = 1.4397xR2 = 0.448

0

5

10

15

20

25

0 5 10 15Predicted

Ob

se

rve

d

Range = 30 - 40

y = 1.01xR2 = -0.125

0

10

20

30

40

50

0 10 20 30 40 50

Predicted

Ob

serv

ed

The Tyranny of R2 (and r2)

Limitations of R2 (and r2) as a measure of goodness of fit...

- Not an absolute measure (as frequently assumed),

- particularly when the variance of the appropriate PDF is NOT independent of the mean (expected) value

- i.e. lognormal, gamma, Poisson,

Poisson PDF

0.0

0.1

0.2

0.3

0 5 10 15 20 25 30

x

Pro

b(x

)

m = 2.5

m = 5

m = 10

Gamma Distributed Data...

i

ii

y

eyyf 1

)(

1),|(

i

i

iii

yVar

y of value expected yE

2

)(

)(

The variance of the gamma increases as the square of the mean!...

Gamma PDF ( = 1)

0

0.1

0.2

0.3

0.4

0 5 10 15 20

X

p(x

) u = 1

u = 5

u = 10

So, how good is good?

Our assessment is ALWAYS subjective, because of- Complexity of the process being studied

- Sources of noise in the data

From a likelihood perspective, should you ever expect R2 = 1?

Other Goodness of Fit Issues...

In complex models, a good fit may be due to the overwhelming effect of one variable...

The best-fitting model may not be the most “general”

- i.e. the fit can be improved by adding terms that account for unique variability in a specific dataset, but that limit applicability to other datasets. (The curse of ad hoc multiple regression models...)

How good is good: deviance

Comparison of your model to a “full” model, given the probability model.

For i = 1..n observations, a vector X of observed data (xi), and a

vector of j = 1..m parameters (j):

Define a “full” model with n parameters i = xi (full). Then:

)]|()|([2 (D) Deviance XllXll full

n

iixgXL(ll) likelihood-Log

1

)|(ln|ln

Nelder and Wedderburn (1972)

Deviance for normally-distributed data

)2log(2

1)|( 2 nXll full

Log-likelihood of the full model is a function of both sample size (n) and variance (2)

Therefore – deviance is NOT an absolute measure of goodness of fit...

But, it does establish a standard of comparison (the full model), given your sample size and your estimate of the underlying variance...

0

0.2

0.4

0.6

0.8

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

X

Pro

b(x

)

Forms of Bias

N

ii

i

N

ii

exp

)(expobs slope

1

2

1

)(

y = 1.0508x + 8.8405

0

5

10

15

20

25

30

35

40

0 5 10 15 20

Predicted

Ob

se

rve

d

1:1 line

y = 1.48x + 0.9515

0

5

10

15

20

25

30

35

0 5 10 15 20

Predicted

Ob

se

rve

d

1:1 line

Proportional bias (slope not = 1)

Systematic bias (intercept not = 0)

“Learn from your mistakes”(Examine your residuals...)

Residual = observed – predicted

Basic questions to ask of your residuals:

- Do they fit the PDF?

- Are they correlated with factors that aren’t in the model (but maybe should be?)

- Do some subsets of your data fit better than others?

Using Residuals to Calculate Prediction Error

RMSE: (Root mean squared error) (i.e. the standard deviation of the residuals)

1

)(1

2

n

exp obsRMSE

n

iii

Predicting lake chemistry from spatially-explicit watershed data

At steady state:

decay) inlake rate (flushing volume lake

input ionconcentrat

Where concentration, lake volume and flushing rate are observed,

And input and inlake decay are estimated

k decay inlake

disteEinput iji

C

i

P

ji

1 1

Predicting iron concentrations in Adirondack lakes

Adirondack Lake Iron Concentrations

y = 1.0042x

R2 = 0.563

0

20

40

60

80

100

0 20 40 60 80 100

Predicted

Ob

se

rve

d

-40 -30 -20 -10 0 10 20 30 40 50 60RESIDUAL

-3

-2

-1

0

1

2

3

Ex p

ec t

ed V

a lue

for

No

r mal

Dis

trib

utio

n

-50-40-30-20-10

0102030405060

0 20 40 60 80 100

Predicted

Re

sid

ua

l (O

bs

- P

red

)

Results from a spatially-explicit, mass-balance model of the

effects of watershed composition on lake chemistry

Source: Maranger et al. (2006)

Should we incorporate lake depth?

-50

-40

-30

-20

-10

0

10

20

30

40

50

60

0 5 10 15

Lake Depth (m)

Re

sid

ua

ls (

Ob

s -

Pre

d)

•Shallow lakes are more unpredictable than deeper lakes•The model consistently underestimates Fe concentrations in deeper lakes

Adding lake depth improves the model...

Model with depth term included

y = 1.0082x

R2 = 0.6533

0

20

40

60

80

100

0 20 40 60 80 100

Predicted

Ob

se

rve

d

R2 went from 56% to 65%

-30 -20 -10 0 10 20 30 40 50 60DEPTHRESID

-3

-2

-1

0

1

2

3

Ex p

ecte

d V

a lue

for

Nor

ma l

Dis

trib

ut io

n

It is just as important that it made sense to add depth...

But shallow lakes are still a problem...

Model with depth added

-40

-30

-20

-10

0

10

20

30

40

50

60

0 5 10 15

Predicted

Re

sid

ua

l (O

bs

- P

red

)

Summary – Model Evaluation

There are no silver bullets...

The issues are even muddier for categorical data...

An increase in goodness of fit does not necessarily result in an increase in knowledge…- Increasing goodness of fit reduces uncertainty in the

predictions of the models, but this costs money (more and better data). How much are you willing to spend?

- The “signal to noise” issue: if you can see the signal through the noise, how far are you willing to go to reduce the noise?

day 7 model evaluation. elements of model evaluation l goodness of fit l prediction error l bias l...

Documents

good fit r

lousy fit r

sensitivity of r

goodness of fit statistic

package reports r

squared correlation

observed slide

examples goodness of