outliers and influential data points

Outliers and influential data points

No outliers?

14121086420

70

60

50

40

30

20

10

0

x

y

An outlier? Influential?

14121086420

70

60

50

40

30

20

10

0

x

y


14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x


14121086420

70

60

50

40

30

20

10

0

x

y


14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.47 + 4.93 x


14121086420

70

60

50

40

30

20

10

0

x

y


14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 8.51 + 3.32 x

Impact on regression analyses

• Not every outlier strongly influences the estimated regression function.

• Always determine if estimated regression function is unduly influenced by one or a few cases.

• Simple plots for simple linear regression.• Summary measures for multiple linear

regression.

The hat matrix H

The hat matrix H

Least squares estimates yXXXb '1'

The regression model XY

XYE

Fitted values yXXXXXby '1'ˆ

Hyy ˆ

7

10

15

8

4

3

2

1

y

y

y

y

y

8.231

5.331

5.65.61

42.41

1

1

1

1

2414

2313

2212

2111

xx

xx

xx

xx

X

664.0044.0152.0444.0

044.0994.0979.1058.0

152.0979.1931.0202.0

444.0058.0202.0411.0

'1' XXXXH

36.6

08.10

71.14

85.8

7

10

15

8

664.0044.0152.0444.0

044.0994.0979.1058.0

152.0979.1931.0202.0

444.0058.0202.0411.0

ˆ Hyy

44434241

34333231

24232221

14131211

hhhh

hhhh

hhhh

hhhh

H

444343242141

434333232131

424323222121

414313212111

4

3

2

1

44434241

34333231

24232221

14131211

ˆ

yhyhyhyh

yhyhyhyh

yhyhyhyh

yhyhyhyh

y

y

y

y

hhhh

hhhh

hhhh

hhhh

Hyy

4

3

2

1

y

y

y

y

y

Identifying outlying Y values

Identifying outlying Y values

• Residuals

• Standardized residuals– also called internally studentized residuals

• Deleted residuals

• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals

Residuals

iii yye ˆ

Ordinary residuals defined for each observation, i = 1, …, n:

Using matrix notation:

yXXXXyyye '1'ˆ

yHIHyye

Variance of the residuals

yHIHyye

HIeVar 2

iii heVar 12

Residual vector

Variance matrixVariance of the ith residual

Estimated variance of the ith residual

iii hMSEes 1

Standardized residuals

iii

i

ii

hMSE

e

es

ee

1*

Standardized residuals defined for each observation, i = 1, …, n:

Standardized residuals quantify how large the residuals are in standard deviation units.

Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.

An outlying y value?

14121086420

70

60

50

40

30

20

10

0

x

y

x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110

S = 4.711

Unusual Observations

Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R

R denotes an observation with a large standardized residual

Deleted residuals

If observed yi is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual.

Delete the ith case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the ith case.

Deleted residual )(ˆ iiii yyd

Deleted t residuals

A deleted t residual is just a standardized deleted residual:

ii

i

i

ii

hMSE

d

ds

dt

1)(

The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.

109876543210

15

10

5

0

x

y

y = 0.6 + 1.55 x

y = 3.82 - 0.13 x

x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990

14121086420

70

60

50

40

30

20

10

0

x

y

y = 1.73 + 5.12 x

y = 2.96 + 5.04 x

Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012

Identifying outlying X values

Identifying outlying X values

• Use the diagonal elements, hii, of the hat matrix H to identify outlying X values.

• The hii are called leverages.

Properties of the leverages (hii)

• The hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.

• The hii is a number between 0 and 1, inclusive.

• The sum of the hii equals p, the number of parameters.

0 1 2 3 4 5 6 7 8 9

x

Dotplot for x

sample mean = 4.751

h(11) = 0.176 h(20,20) = 0.163h(11,11) = 0.048

HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.048147 0.049628 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.141136 0.140453 0.163492 0.050974

Sum of HI1 = 2.0000

444343242141

434333232131

424323222121

414313212111

4

3

2

1

44434241

34333231

24232221

14131211

ˆ

yhyhyhyh

yhyhyhyh

yhyhyhyh

yhyhyhyh

y

y

y

y

hhhh

hhhh

hhhh

hhhh

Hyy

Properties of the leverages (hii)

If the ith case is outlying in terms of its X values, it has a large leverage value hii, and therefore exercises substantial leverage in determining the fitted value.

Using leverages to identify outlying X values

Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….

n

p

n

hh

n

iii

1

…or if it’s greater than 0.99.

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p

Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X

X denotes an observation whose X value gives it largeinfluence.

x y HI1 14.00 68.00 0.357535

14121086420

70

60

50

40

30

20

10

0

x

y

286.021

233

n

p x y HI213.00 15.00 0.311532

Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX

R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.

Identifying influential cases

Influence

• A case is influential if its exclusion causes major changes in the estimated regression function.

Identifying influential cases

• Difference in fits, DFITS

• Cook’s distance measure

DFITS

ii

iii

iii

iiii h

ht

hMSE

yyDFITS

1

ˆ

)(

)(

The difference in fits …

… represent the number of standard deviations that the fitted value increases or decreases when the ith case is included.

DFITS

A case is influential if the absolute value of its DFIT value is …

n

p2

… greater than 1 for small to medium data sets

…greater than for large data sets

14121086420

70

60

50

40

30

20

10

0

x

y

62.021

222

n

p x y DFIT114.00 68.00 -1.23841

14121086420

70

60

50

40

30

20

10

0

x

y

62.021

222

n

p x y DFIT213.00 15.00 -11.4670

Cook’s distance

pMSE

yy

D

n

jijj

i

1

2)(ˆ

Cook’s distance measure …

… considers the influence of the ith case on all n fitted values.

Cook’s distance

• Relate Di to the F(p, n-p) distribution.

• If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith case has lots of influence.

14121086420

70

60

50

40

30

20

10

0

x

y

7191.0)19,2,50.0( F x y COOK114.00 68.00 0.701960

14121086420

70

60

50

40

30

20

10

0

x

y

7191.0)19,2,50.0( F x y COOK213.00 15.00 4.04801

outliers and influential data points

Documents

outlying x valuesidentifying

x y resi1 tres1

y valuesidentifying

x valuesuse

outlying y value

standardized deleted

row x y resi1 sres1

regression analysesnot