seminar on robust regression methods

ROBUST REGRESSION METHOD

By,SUMON JOSE

A Seminar Presentation

Under the Guidence of Dr. Jessy John

February 24, 2015

SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 1 / 69

CONTENTS

1 INTRODUCTION2 REVIEW3 ROBUSTNESS & RESISTANCE4 APPROACH5 STRENGTHS & WEAKNESSES6 M- ESTIMATORS7 DELIVERY TIME PROBLEM8 ANALYSIS9 PROPERTIES10 SURVEY OF OTHER ROBUST REGRESSION

ESTIMATORS11 REFERENCE


INTRODUCTION

Perfomance Evaluation- Geethu Anna Jose


REVIEW

The classical linear regression model relates the

dependednt or response variables yi to independent

explanatory variables xi1, xi2, ..., xip for i = 1, .., n, such

that

yi = xTi β + εi , (1)

for i=1,...,n

where xTi = (xi1, xi2, ..., xip), εi denote the error terms and

β = (β1, β2, ..., βp)T


REVIEW

The expected value of yi called the fitted value is

yi = xTi β (2)

and one can use this to calculate the residual for the i th

case,

ri = yi − yi (3)

In the case of simple linear regression model we may

calculate the value of β0 and β1 using the following

formulae:SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 5 / 69

REVIEW

β1 =

∑ni=1 yixi −

∑ni=1 yi

∑ni=1 xi

n∑ni=1 x

2i −

(∑n

i=1 xi )2

n

(4)

β0 = y − β1x (5)

The vector of fitted values yi curresponding to the

observed value yi may be expressed as follows:

y = X β (6)


REVIEW

Limitations of Least Square Estimator

Extremly sensitive to deviations from the model

assumptions (as normal distribution is assumed for the

errors).

Drastically changed by the effect of outliers.


REVIEW

What About Deleting Outliers Before Analysis

All the Outliers need not be erroneous data, they

could be exceptional occurances

Some of such Outliers could be the result of some

factors not considered in the current study

So in general, unusual observations are not always bad

observations. Moreover in large data it is often very

difficult to spot out the outlying data.SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 8 / 69

ROBUSTNESS AND RESISTANCE

Resistant Regression Estimators

Definition

The Resistant regression estimators are primarily

concerned with robustness of validity: meaning that their

main concern is to prevent unsual observations from

affecting the estimates produced.



Robust Regression Estimators

Definition

They are concerned with both robustness of efficiency and

robustness of validity, meaning that they should also

maintain a small sampling variance, even when the data

does not fit the assumed distribution.



⇒ In general Robust regression estimators aim to fit

a model that describes the majority of a sample.

⇒ Their robustness is achieved by giving the data

different weights

⇒ Whereas in Least Square Approximation all data

are treated equally.


APPROACH

Robust Estimation methods are powerful tools in

detection of outliers in complicated data sets.

But unless the data is very well behaved, different

estimators would give different estimates.

On their own, they do not provide a final model.

A healthy approach would be to employ both robust

regression methods as well as least square method to

compare the results.


STRENGTHS & WEAKNESSES

Finite Sample Breakdown Point

Definition

Breakdown Point (BDP) is the measure of the resistance

of an estimator. The BDP of a regression estimator is the

smallest fraction of contamination that can cause the

estimator to ’breakdown’ and no longer represent the

trend of the data.



When an estimator breaks down, the estimate it produces

from the contaminated data can become arbitrarily far

from the estimate than it would give when the data was

uncontaminated.



In order to describe the BDP mathematically, define T as

a regression estimator, Z as a sample of n data points and

T (Z ) = β. Let Z′

be the corrupted sample where m of

the original data points are replaced with arbitrary values.

The maximum effect that could be caused by such

contamination is

effect(m; T ,Z ) = supz ′ |T (Z′)− T (Z )| (7)


STRENGTHS & WEAKNESS

When (7) is infinite, an outlier can have an arbitrarily

large effect on T . The BDP of T at the sample Z is

therefore defined as:

BDP(T ,Z ) = min{m

n: effect(M ; T ,Z )is infinite} (8)


STRENGTH & WEAKNESSES

The Least Square Method estimator for example has a

breakdown point of 1n because just one leverage point can

cause it to breakdown. As the number of data increases,

the breakdown point tends to 0 and so it is said to that

the least squares estimator has BDP 0%.


STRENGTH & WEAKNESS

Remark

The highest breakdown point one can hope for is 50% as

if more than half the data is contaminated that one

cannot differentiate between ’good’ and ’bad’ data.



Relative Efficiency of an Estimator

Definition

The efficiency of an estimator for a particular parameter is

defined as the ratio of its minimum possible variance to

its actual variance. Strictly, an estimator is considered

’efficient’ when this ratio is one.



High efficiency is crucial for an estimator if the intention

is to use an estimate from sample data to make inference

about the larger population from which the same was

drawn.



Relative Efficiency

Relative efficiency compares the efficiency of an

estimator to that of a well known method.

In the context of regression, estimators are compared

to the least squares estimator which is the most

efficient estimator known.



Given two estimators T1 and T2 for a population

parameter β, where T1 is the most efficient estimator

possible and T2 is less efficient, the relative efficiency of

T2 is calculated as the ratio of its mean squared error to

the mean squared error of T1

Efficiency(T1,T2) =E [(T1 − β)2]

E [(T2 − β)2](9)


M-ESTIMATORS

Introduction

1 Were first proposed by Huber(1973)

2 But the early ones had the weakness in terms of one

or more of the desired properties

3 From them developed the modern means


M-ESTIMATORS

Maximum Likelihood Type Estimators

M-estimation is based on the idea that while we still want

a maximum likelihood estimator, the errrors might be

better represented by a different, heavier tailed

distribution.


M-ESTIMATORS

If the probability distribution function of the error of f (εi),

then the maximum likelihood estimator for β is that

which maximizes the likelihood function

n∏i=1

f (εi) =n∏

i=1

f (yi − xTi β) (10)


M-ESTIMATORS

This means, it also maximizes the log-likelihood function

n∑i=1

ln f (εi) =n∑

i=1

ln f (yi − xTi β) (11)

When the errrors are normally distributed it has been

shown that this leads to minimising the sum of squared

residuals, which is the ordinary least square method.


M-ESTIMATORS

Assuming the the errors are differently distributed, leads to

the maximum likelihood esimator, minimising a different

function. Using this idea, an M-estimator β minimizes

n∑i=1

ρ(εi) =n∑

i=1

ρ(yi − xTi β) (12)

where ρ(u) is a continuous, symmetric function called the

objectve function with a unique minimum at 0.


M-ESTIMATORS

1 Knowing the appropriate ρ(u) to use requires

knowledge of how the errors are really distributed.

2 Functions are usually chosen through consideration of

how the resulting estimator down-weights the larger

residuals

3 A Robust M-estimator achieves this by minimizing the

sum of a less rapidly increasing objective function than

the ρ(u) = u2 of the least squares


M-ESTIMATORS

Constructing a Scale Equivariant Estimator

The M-estimator is not necessarily scale invariant i.e. if

the errors yi − xTi β were multiplied by a constant, the

new solution to the above equation might not be the

same as the scaled version of the old one.


M-ESTIMATORS

To obtain a scale invariant version of this estimator we

usually solve,

n∑i=1

ρ(εis

) =n∑

i=1

ρ(yi − xT

i β

s) (13)

where s is a robust estimate of scale.


M-ESTIMATORS

A popular choice for s is the re-scaled median absolute

deivation

s = 1.4826XMAD (14)

where MAD is the Median Absolute Deviation

MAD = Median|yi − xTi β| = Median|εi | (15)


M-ESTIMATORS

’s’ is highly resistant to outlying observations, with BDP

50%, as it is based on the median rather than the mean.

The estimator rescales MAD by the factor 1.4826 so that

when the sample is large and εi really distributed as

N(0, σ2)), s estimates the standard deviation.


M-ESTIMATORS

With a large sample and εi ∼ N(0, σ2):

P(|εi | < MAD) ≈ 0.5

⇒ P(|εi−0σ | <

MADσ ) ≈ 0.5

⇒ P(|Z | < MADσ ) ≈ 0.5

⇒ MADσ ≈ Φ−1(0.75)


M-ESTIMATORS

⇒ MADΦ−1 ≈ σ

1.4826 X MAD ≈ σ

Thus the tuning constant 1.4826 makes s an

approximately unbiased estimator of σ if n is large and the

error distribution is normal.


M-ESTIMATORS

Finding an M-Estimator

To obtain an M-estimate we solve,

Minimizeβ

n∑i=1

ρ(εis

) = Minimizeβ

n∑i=1

ρ(yi − x

′

iβ

s) (16)

For that we equate the first partial derivatives of ρ with

respect to βj (j=0,1,2,3,...,k) to zero, yielding a necessary

condition for a minimum.


M-ESTIMATORS

This gives a system of p = k + 1 equations

n∑i=1

Xijψ(yi − x

′

iβ

s) = 0, j = 0, 1, 2, ..., k (17)

where ψ = ρ′ and Xij is the i th observation on the j th

regressor and xi0 = 1. In general ψ is a non-linear

function and so equation (17) must be solved iteratively.

The most widely used method to find this is the method

of iteratively reweighted least squares.


M-ESTIMATORS

To use iteratively reweighted least squares suppose that aninitial estimate of β0 is available and that s is an estimateof the scale. Then we write the p = k + 1 equations as:

n∑i=1

Xijψ(yi − x

′i β

s) =

n∑i=1

xij{ψ[(yi − x ′iβ)/s]/(yi − x ′iβ)/s}(yi − x ′iβ)

s= 0

(18)


M-ESTIMATORS

as

n∑i=1

XijW0i (yi − xiβ) = 0, j = 0, 1, 2, ..., k (19)

where

W 0i =

ψ[

(yi−x′i β)

s ]

(yi−x′iβ)

s

if yi 6= x′

i β0

1 if yi = x′

i β0

(20)


M-ESTIMATORS

We may write the above equation in matrix form as

follows:

X′W 0Xβ = X

′W 0y (21)

where W0 is an n X n diagonal matrix of weights with

diagonal elements given by the expression

W 0i =

ψ[

(yi−x′i β)

s ]

(yi−x′iβ)

s

if yi 6= x′

i β0

1 if yi = x′

i β0

(22)


M-ESTIMATORS

From the matrix form we realize that the expression is

same as that of the usual weighted least squares normal

equation. Consequently the one step estimator is

β1 = (X′W 0X )−1X

′W 0y (23)

At the next step we recompute the weights from the

equation for W but using β1 and not β0


M-ESTIMATORS

NOTE:

Usually only a few iterations are required to obtain

convergence

It could be easily be implemented by a computer

programme.


M-ESTIMATORS

Re-Descending Estimators

Re- descending M estimators are those which have

influence functions that are non decreasing near the origin

but decreasing towards zero far from the origin.

Their ψ can be chosen to redescend smoothly to zero, so

that they usually satisfy ψ(x) = 0 for all |x | > r where r

is referred to as the minimum rejection point.


M-ESTIMATORS


M-ESTIMATORS

Robust Criterion Functions

Citerion ρ ψ(z) w(x) range

Least

Squares z2

2 z 1.0 |z | <∞Huber’s

t-function z2

2 z 1.0 |z | < t

t = 2 |z |t − t2

2 tsign(z) t|z | |x | > t

Andrew’s

Wave function a(1− cos(za)) sin(za)sin( z

a )za|z | ≤ aπ


DELIVERY TIME PROBLEM

ProblemA Softdrink bottler is analyzing the vending machine service routes in hisdistriution system. He is interested in predicting the amount of timerequired by the route driver to service the vending machines in an outlet.This service activity includes stocking the machine with beverage productsand minor maintenance or housekeeping. The industrial engineerresponsible for the study has suggested that the two most importantvariables affecting the delivery time (y) are the numer of cases of productstocked (x1) and the distance walked by the route driver (x2). Theengineer has collected 25 observations on delivery time, which are shownin the following table. Fit a regression model into it.



Table of DataObservation Delivery time Number of cases Distance in Feets

i (in minutes) y x1 x21 16.8 7 5602 11.50 3 3203 12.03 3 3404 14.88 4 805 13.75 6 1506 18.11 7 3307 8 2 1108 17.83 7 2109 79.24 30 1460

10 21.50 5 60511 40.33 16 68812 21 10 21513 13.50 4 255



Observation Delivery time Number of cases Distance in Feets(in minutes) y x1 x2

14 19.75 6 46215 24.00 9 44816 29.00 10 77617 15.35 6 20018 19.00 7 13219 9.50 3 3620 35.10 17 77021 17.90 10 14022 52.32 26 81023 18.75 9 45024 19.83 8 63525 10.75 4 150



Least Square Fit of the Delivery Time DataObs. yi yi ei Weight

1 .166800E+02 .217081E+02 -.502808E+01 .100000E+012 0115000E+02 .103536E+02 .114639E+01 .100000E+013 .120300E+02 .120798E+02 -.497937E-01 .100000E+014 .148800E+02 .995565E+01 .492435E+01 .100000E+015 .137500E+02 .141944E+02 -.444398E+00 .100000E+016 .181100E+02 .183996E+02 -.289574E+00 .100000E+017 .800000E+01 .715538E+01 .844624E+00 .100000E+018 .178300E+02 .166734E+02 .115660E+02 .100000E+019 .792400E+02 .718203E+02 .741971E+01 .100000E+01

10 .215000E+02 .191236E+02 .237641E+01 .100000E+0111 .403300E+02 .380925E+02 .223749E+01 .100000E+0112 .2100000E+02 .215930E+02 -.593041E+00 .100000E+0113 .135000E+02 .124730E+02 .102701E+01 .100000E+01



Obs. yi yi ei Weight

14 .197500E+02 .186825E+02 .106754E+01 .100000E+0115 .240000E+02 .233288E+02 .671202E+00 .100000E+0116 .290000E+02 .296629E+02 -.662928E+00 .100000E+0117 .153500E+02 .149136E+02 .436360E+00 .100000E+0118 .190000E+02 .155514E+02 .344862E+01 .100000E+0119 .950000E+01 .770681E+01 .179319E+01 .100000E+0120 .351000E+02 .408880E+02 -.578797E+01 .100000E+0121 .179000E+02 .205142E+02 -.261418E+01 .100000E+0122 .523200E+02 .560065E+02 -.368653E+01 .100000E+0123 .187500E+02 .233576E+02 -.460757E+01 .100000E+0124 .198300E+02 .244029E+02 -.457285E+01 .100000E+0125 .107500E+02 .109626E+02 -.212584E+00 .100000E+01



Accordingly we have the following values for the

parameters:

β0 = 2.3412

β1 = 1.6159

β2 = 0.014385 Thus we have the regression line as

follows:

yi = 2.3412 + 1.6159x1 + 0.014385x2 (24)



Huber’s t-Function, t=2Obs. yi yi ei Weight

1 .166800E+02 .217651E+02 -.508511E+01 .639744E+002 .115000E+02 .109809E+02 .519115E+00 .100000E+013 .120300E+02 .126296E+02 -.599594E+00 .100000E+014 .148800E+02 .105856E+02 .429439E+01 .757165E+005 .137500E+02 .146038E+02 -.853800E+00 .100000E+016 .181100E+02 .186051E+02 -.495085E+00 .100000E+017 .800000E+01 .794135E+01 .586521E-01 .100000E+018 .178300E+02 .169564E+02 .873625E+00 .100000E+019 .792400E+02 .692795E+02 .996050E+01 .327017E+00

10 .215000E+02 .193269E+02 .217307E+01 .100000E+0111 .403300E+02 .372777E+02 .305228E+01 .100000E+0112 .210000E+02 .216097E+02 -.609734E+00 .100000E+0113 .135000E+02 .129900E+02 .510021E+00 .100000E+01



Obs. yi yi ei Weighti

14 .197500E+02 .188904E+02 .859556E+00 .100000E+0115 .240000E+02 .232828E+02 .717244E+00 .100000E+0116 .290000E+02 .293174E+02 -.317449E+00 .100000E+0117 .153500E+02 .152908E+02 .592377E-01 .100000E+0118 .190000E+02 .158847E+02 .311529E+01 .100000E+0119 .950000E+01 .845286E+01 .104714E+01 .100000E+0120 .351000E+02 .399326E+02 -.483256E+01 .672828E+0021 .179000E+02 .205793E+02 -.267929E+01 .100000E+0122 .523200E+02 .542361E+02 -.191611E+01 .100000E+0123 .187500E+02 .233102E+02 -.456023E+01 .713481E+0024 .198300E+02 .243238E+02 .449377E+01 .723794E+0025 .107500E+02 .115474E+02 -.797359E+00 .100000E+01



Accordingly we get the values of the parameters as

follows: β0 = 3.3736

β1 = 1.5282

β2 = 0.013739

Thus we get the regression line as follows:

yi = 3.3736 + 1.5282x1 + 0.013739x2 (25)



Andrew’s Wave Function with a = 1.48Obs. yi yi ei Weight

i

1 .166800E+02 .216430E+02 -.496300E+01 .427594E+002 .115000E+02 .116923E+02 -.192338E+00 .998944E+003 .120300E+02 .131457E+02 .-.111570E+01 .964551E+004 .148800E+02 .114549E+02 .342506E+01 .694894E+005 .137500E+02 .152191E+02 -.146914E+01 .939284E+006 .181100E+01 .188574E+02 -.747381E+00 .984039E+007 .800000E+01 .890189E+01 .901888E+00 .976864E+008 .178300E+02 ..174040E+02 ..425984E+00 .994747E+009 .792400E+02 .660818E+02 .131582E+02 .0

10 .215000E+02 .192716E+02 .222839E+01 .863633E+0011 .403300E+02 .363170E+02 .401296E+01 .597491E+0012 .210000E+02 .218392E+02 -.839167E+00 .980003E+0013 .135000E02 .135744E+02 -.744338E+01 .999843E+00



Obs. yi yi ei Weighti

14 .197500E+02 .198979E+02 .752115E+00 .983877E+0015 .240000E+02 .232029E+02 .797080E+00 .981854E+0016 ..290000E+02 .286336E+02 .366350E+00 .996228E+0017 .153500E+02 .158247E+02 -.474704E+00 .993580E+0018 .190000E+02 .164593E+02 .254067E+01 .824146E+0019 .950000E+01 .946384E+01 .361558E-01 .999936E+0020 .351000E+02 .387684E+02 -.366837E+01 .655336E+0021 .179000E+02 .209308E+02 -.303081E+01 .756603E+0022 .523200E+02 .523766E+02 -.566063E-01 .999908E+0023 .187500E+02 .232271E+02 .-.447714E+01 .515506E+0024 .198300E+02 .240095E+02 -.417955E+01 .567792E+0025 .107500E+02 .123027E+02 -1.55274E+01 .932266E+00



Thus we have the estimates as follows:

β0 = 4.6532

β1 = 1.4582

β2 = 0.012111

Thus we get the regression line as follows:

yi = 4.6532 + 1.4582x1 + 0.012111x2 (26)


ANALYSIS

Computing M-Estimators

Robust regression methods are not an option in most

statistical software today.

SAS, PROC, NLIN etc can be used to implement

iteratively reweighted least squares procedure.

There are also Robust procedures available in S-Pluz.


ANALYSIS

Robust Regression Methods...

Robust regression methods have much to offer a data

analyst.

They will be extremly helpful in locating outliers and

hightly influential observations.

Whenever a least squares analysis is perfomed it would

be useful to perform a robust fit also.


ANALYSIS

If the results of both the fit are in substantial

agreement, the use of Least Square Procedure offers a

good estimation of the parameters.

If the results of both the procedures are not in

agreement, the reason for the difference should be

identified and corrected.

Special attention need to be given to observations

that are down weighted in the robust fit.SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 61 / 69

PROPERTIES

Breakdown Point The finite sample breakdown point is

the smallest fraction of anomalous data that can cause the

estimator to be useless. The smallest possible breakdown

poit is 1n , i.e. s single observation can distort the estimator

so badly that it is of no practical use to the regression

model builder. The breakdown point of OLS is 1n .


PROPERTIES

M-estimators can be affected by x-space outliers in an

identical manner to OLS.

Consequently, the breakdown point of the class of m

estimators is 1n as well.

We would generally want the breakdown point of an

estimator to exceed 10%.

This has led to the development of High Breakdown

point estimators.SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 63 / 69

PROPERTIES

Efficiency

The M estimators have a higher efficiency than the least

squares, i.e. they behave well even as the size of the

sample increases to ∞.


SURVEY OF OTHER ROBUSTREGRESSION ESTIMATORS

High Break Down Point Estimators Because both the

OLS and M-estimator suffer from a low breakdown point

1n , considerable effort has been devoted to finding

estimators that perform better with respect to this

property. Often a break down point of 50% is desirable.


SURVEY OF OTHER ROBUSTREGRESSION ESTIMATORS

There are various other estimation procedures like

Least Median of Squares

Least Trimmed Sum of Squres

S Estimators

R and L Estimators

Robust Ridge regression

MM Estimation etc.


ABSTRACT & CONCLUSION

Review ⇒ Robustness and Resistance ⇒Our Approach ⇒ Strengths and Weaknesses

⇒ M-Estimators ⇒ Delivery time

problem ⇒ Analysis ⇒ Properties ⇒Survey of other Robust Regression Estimators


REFERENCE

1 Draper, R Norman. & Smith, Harry. “Applied Regression

Analysis”, 3rd edn., John Wiley and Sons, New York, 1998.

2 Montgomery, C Douglas. Peck, A Elizabeth. & Vining, Geoffrey

G. “Introduction to Linear Regression Analysis”, 3rd edn., Wiley

India, 2003.

3 Brook J, Richard. “Applied Regression Analysis and

Experimental Design”, Chapman & Hall, London, 1985.

4 Rawlings O, John. “Applied Regression Analysis: A Research

Tool”, Springer, New York, 1989.

5 Pedhazar, Elazar J. “Multiple Regression in Behavioural Research:

Explanation and Prediction”, Wadsworth, Australia, 1997SUMON JOSE (NIT CALICUT) ROBUST REGRESSION METHOD February 24, 2015 68 / 69

THANK YOU


seminar on robust regression methods

Documents