robust, smoothly heterogeneous variance regression

16
Robust, Smoothly Heterogeneous Variance Regression Author(s): Michael Cohen, Siddhartha R. Dalal and John W. Tukey Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 42, No. 2 (1993), pp. 339-353 Published by: Wiley for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2986237 . Accessed: 17/12/2014 10:52 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics). http://www.jstor.org This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AM All use subject to JSTOR Terms and Conditions

Upload: siddhartha-r-dalal-and-john-w-tukey

Post on 11-Apr-2017

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Robust, Smoothly Heterogeneous Variance Regression

Robust, Smoothly Heterogeneous Variance RegressionAuthor(s): Michael Cohen, Siddhartha R. Dalal and John W. TukeySource: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 42, No. 2(1993), pp. 339-353Published by: Wiley for the Royal Statistical SocietyStable URL: http://www.jstor.org/stable/2986237 .

Accessed: 17/12/2014 10:52

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access toJournal of the Royal Statistical Society. Series C (Applied Statistics).

http://www.jstor.org

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 2: Robust, Smoothly Heterogeneous Variance Regression

Appl. Statist. (1993) 42, No. 2, pp. 339-353

Robust, Smoothly Heterogeneous Variance Regression By MICHAEL COHENt

University of Maryland, College Park, USA

SIDDHARTHA R. DALAL

Bellcore, Morristown, USA

and JOHN W. TUKEY

Princeton University, USA

[Received October 1989. Final revision December 1991]

SUMMARY Under the simple linear regression model, we consider two violations of the standard assumptions, namely heterogeneous variances and long-tailed error distributions, in an integrated manner. A new method for estimation is proposed which assumes only that the heterogeneity is a locally smooth function of the regressor variable, except for outliers. The procedure is based on smoothing the non- outlying residuals from a robust regression to provide weights for a weighted regression. Monte Carlo results, some theory and a real data example are given. It is shown that the method is substantially more efficient than the usual robust regression methods in the presence of heterogeneity and only slightly worse when the variances are exactly equal.

Keywords: Biweight; Heteroscedasticity; Monte Carlo methods; Robust regression

1. Introduction Consider the following simple linear regression problem. We have 79 observations, presented in Table 1, providing the number of hours, y, needed to splice x pairs of wires for a particular type of telephone cable ('random'). (These data are presented in Chambers et al. (1983), pages 244 and 262-267.) As implied in Table 1, we assume that the observations (yi, xi) have been sorted by xi. We would like to find an estimate of the slope of the regression line y9i = a + bxi that is efficient, even if the error variances turn out not to be homogeneous.

A scatterplot for this data set is given in Fig. 1. (We note that there are several instances of multiple responses for various values of x.) We see that the assumption of linearity is supported. This is not surprising since the time needed to splice cables should be well approximated by a constant times the number of pairs of cables to be spliced, with an additional additive constant to account for some fixed time costs. However, the assumption of homogeneous error variances is not clearly justified; the

tAddressfor correspondence: School of Public Affairs, University of Maryland, College Park, MD 20742, USA.

? 1993 Royal Statistical Society 0035-9254/93/42339 $2.00

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 3: Robust, Smoothly Heterogeneous Variance Regression

340 COHEN, DALAL AND TUKEY

TABLE 1 Data for the random cable t

Observation x y Observation x y

1 50 0.27 41 2210 4.25 2 50 0.72 42 2214 8.23 3 52 0.58 43 2214 8.47 4 100 3.92 44 2218 6.95 5 150 1.48 45 2358 7.02 6 200 1.15 46 2400 12.83 7 200 1.98 47 2750 10.05 8 200 2.62 48 2800 8.03 9 300 1.15 49 2800 8.28

10 300 4.33 50 2800 9.40 11 400 3.58 51 2800 9.98 12 400 6.30 52 3000 4.73 13 406 1.47 53 3000 7.53 14 450 2.02 54 3000 8.85 15 500 3.35 55 3016 15.67 16 600 5.00 56 3100 10.43 17 700 1.98 57 3116 17.85 18 802 3.42 58 3200 6.02 19 900 7.57 59 3400 9.68 20 916 2.52 60 3600 8.05 21 1000 5.13 61 3600 12.12 22 1200 2.48 62 3600 17.17 23 1200 2.70 63 3608 9.53 24 1200 6.05 64 3616 12.33 25 1200 6.38 65 3616 14.75 26 1204 3.02 66 3644 7.02 27 1204 3.25 67 4220 12.45 28 1300 4.73 68 4222 10.20 29 1300 5.02 69 4222 16.50 30 1350 5.97 70 4400 13.33 31 1624 4.08 71 4800 15.53 32 1800 5.10 72 4814 9.55 33 1800 5.93 73 4816 8.73 34 1800 6.27 74 4822 8.73 35 1800 9.37 75 5422 7.93 36 1808 5.58 76 5422 13.67 37 1814 3.10 77 5424 8.17 38 1827 3.62 78 5428 8.98 39 1950 6.10 79 6300 14.57 40 2116 5.92

tx is the number of wire pairs joined; y is the splicing time in hours; values fory have been rounded to two decimal places.

variability of y given x seems to be greater for larger values of x. Although the appearance of variance heterogeneity might be the result of the inclusion of some possible outliers rather than a dependence of the variance heterogeneity on x, it is clear that the visual support for homogeneous error variances is not strong.

Use of the least squares regression line in this situation (implicitly assigning equal weights to all observations) may result in reduced efficiency over methods that allow for variance heterogeneity. For example, we could use a robust estimate of the regression line, e.g. the biweight (discussed later), which would be effective in

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 4: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 341

* * oautot

LO

00 ' ,>6ffi biweight'

0

/* * * *~~~~~~

**

0

*0 *

0 1000 2000 3000 4000 5000 6000 7000

X

Number of wire pairs

Fig. 1. Scatterplot of the random splicing data

providing weights for outlying points, but might not be effective in situations with few or no outliers but substantial variance heterogeneity dependent on x. Or we could assume a parametric relationship between the variance of the errors and x, and estimate these parameters (possibly robustly) by using regression of some function of the residuals (from a preliminary fit) on x. However, it is not clear that a parametric fit can always be discovered, especially if there are outliers. Also, if a robust parametric fit is used to provide weights to accommodate the heterogeneity dependent on x, and outliers are not downweighted through use of some additional weighting scheme, the parametric model may wrongly assign a substantial weight to the outlying observations.

The two possible divergences from homogeneous error variance discussed, i.e. outliers and smooth heterogeneity of error variance, can and do occur simul-

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 5: Robust, Smoothly Heterogeneous Variance Regression

342 COHEN, DALAL AND TUKEY

taneously. Therefore we need a procedure that provides weights in this combined situation. We have developed a method for determining weights for observations in regression which accommodate both types of divergence from homogeneity.

In Fig. 1 two linear fits are displayed. The biweight line has estimated parameters abhwl = 2.3 and bbiwt = 0.200 (10-2). The line developed here, designated 'new', has estimated parameters anew = 1.87 and bnew = 0.215 (10-2). (A closely related line, developed here and discussed later, also has estimated slope bauto = 0.215 (10-2).) The estimated standard deviation of bnew is 0.02 (10-2) (see Cohen et al. (1983) for methods for estimating the standard deviation of bnew). There is a considerable difference between the estimates, and it is easy to imagine data sets where the difference would be larger. Fig. 2 presents the weights given to each observation for

LO

to

CDJ

0

0 C;J

LC)

0

~4-

a)0

~CD 0

ci)

0 20 40 60 80

Observation number

Fig. 2. Comparison of biweights (---------- ) with auto weights( )

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 6: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 343

both the biweight and the estimator developed here. The weights underlying the estimate developed here for the observations which belong to the subset with the 40 smallest values for x are substantially greater than the remainder, whereas the biweight assigns relatively equal weights for all x, aside from several points which the biweight treats as moderate to extreme outliers. It is this potential for differential weighting which we believe provides the estimator developed here with advantages over traditional robust estimation in the presence of smooth heterogeneity, while retaining the usual advantage with respect to a treatment of outlying observations.

(In examining Fig. 2, note that there are several regions where the weights developed here are flat. A close examination of Table 1 reveals that these regions correspond to x-values with multiple observations. This is because our procedure essentially estimates the local variance from the non-outlying replicated values when available. For this reason, we have plotted weights against observation number rather than against x-values.)

We are not suggesting that the biweight line provides an obviously inferior description of this data set for all purposes. For larger values of x the biweight line has a better fit. However, bnew presents an important choice that is protected against a wider variety of variance heterogeneity. In the following we shall compare the effectiveness of the least squares slope estimate, the biweight slope estimate and the slope estimates developed here, under a variety of forms of variance heterogeneity.

2. Previous Work In the usual simple linear regression model

Yi = ae + oxi + Ei, i= 1,. ..

the Ei are usually taken as independent with mean 0 and variance a2(x1). (The Ei are frequently assumed to have a Gaussian distribution, though this is rarely used in our development and is not essential.) The understanding that the variances are not constant, i.e. that a2(x1) *# 2, has resulted in the development of two almost completely separate areas of research.

The first area of research, robust regression, is concerned with developing estimates for (x and f3 where the narrow assumption of a specifically Gaussian error distribution is broadened to include diverse possibilities, including that of an error distribution with longer tails, e.g. the slash distribution. A situation closely related to this long-tailed case, which often leads to essentially the same estimates, arises when the majority of the data points are assumed to have constant or nearly constant variance, except for an unknown, randomly chosen minority which have a much higher variance. For examples of this see the references in Huber (1981) and Belsley et al. (1980).

The second area of research, smoothly heterogeneous variance regression, assumes that the variability in the error variance is a continuous function of known regressor variable(s) of known form with some unknown parameters. For examples of this area of research see Amemiya (1973), Cook and Weisberg (1982), Rutemiller and Bowers (1968), Hald (1952) and Giltinan et al. (1986).

We call departures from constant variance of the first type described above random extreme divergence from constancy, and we call departures of the second type systematic smooth divergence. The mistaken assumption of constant variance when

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 7: Robust, Smoothly Heterogeneous Variance Regression

344 COHEN, DALAL AND TUKEY

the divergence from constancy is random and extreme can result in fairly drastic shortcomings in the performance of the least squares estimate. In contrast, the mistaken assumption of constant variances when the divergence from constancy is systematic and moderate is not as severe a difficulty, since the least squares estimate will often be adequate for many purposes, depending on the behaviour of u2(xi). But certainly the use of equally weighted least squares estimation can result in a reduction in efficiency compared with estimators that take into account the heterogeneity in the error variances. See Bloch and Moses (1988) for bounds on the performance of misweighted regression estimates.

We proceed by assuming that we do not know the parametric form of u2(xi), as in the references cited earlier. Instead we assume that u2(xi) is smoothly varying, except at a few randomly chosen points where the variance is greatly increased. The two principal contributions of which we are aware are those of Carroll and Ruppert (1982) and Birch and Binkley (1983). As discussed later, both treat less extreme and numerous alternatives than are considered here.

We present an estimator of ,3 which has the following properties. First, where the divergence from constancy is random and extreme, the estimator developed here performs nearly as well as the biweight slope estimate, which was designed to perform well in that situation. Second, where the divergence from constancy is systematic and moderate, the estimator developed here generally outperforms the (equally weighted) least squares estimate, even for variances relatively close to constant. For variances nearly constant, the cost of using the estimate developed here-as measured by a loss in efficiency-is never great.

Finally, in the situations for which the estimator was developed, i.e. the simultaneous occurrence of random and extreme, and of systematic and moderate divergence from constancy, the estimate outperforms both the biweight and the least squares estimates. We emphasize that the approach taken is natural. We use the residuals from a robust regression to give an indication of the variance at each xi and then weight each point accordingly, using a weighted regression, to estimate f3.

3. Experimental Design of Simulation In this section we describe the simulations used to compare the performance of the

estimates of slope developed here to the least squares and the biweight slope estimates. We set xi = (i- 1)/99, for i= 1, . . ., 100 (for 100 equally spaced points on [0, 1]). Because of the equivariance properties of the estimates examined (see Cohen et al. (1983) for a proof), there was no loss of generality in letting eachyi have mean 0. Also yi was selected to have a Gaussian distribution with variance a2(x), described next.

The variance function U2(x1) involved both random, extreme and moderate systematic divergences from constant variance. We examined three basic shapes of systematic smooth divergence:

(a) symmetric quadratics bowl shaped up, (b) symmetric quadratics bowl shaped down and (c) straight lines rising with x (equivalent to straight lines falling with x).

Quadratic shapes are identified, in subsequent text and tables, by the values taken by a2(x) at x values of 0, 0.5 and 1. (Linear shapes are identified by the values taken at x= 0 and x= 1.) The shapes used are catalogued in Table 2.

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 8: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 345

TABLE 2 Systematic divergencesfrom homogeneity for o2(x) t

Bowl up Bowl down Linear

2501, 1, 2501 1, 2501, 1 1, 10001 251, 1, 251 1, 251, 1 1, 1001 26,1,26t 1,26,1t 1,101 9, 1, 91 1, 9, 11 1,211 3,1,31 1,3,1t 1,61 (1, 1, 1)t (1,1,1)10 (1, 1)

tThe homogeneous case is represented three times at the bottom of each shape group. tThese marked shapes were examined with a different seed of the random number generator used from that for the other shapes, i.e. two different seeds were used in the entire simulation study, with one seed used for more moderate designs and the other seed used for the remainder. The marker is used in Table 3 to indicate the same difference in random number generator seeds.

Second, we considered random extreme divergence from homogeneity as superposed on the systematic divergence. With an assumed probabilityp, the existing variance at xi given by one of the shapes in Table 2 was multiplied by a scale factor k > 1. The p-values examined here are 0.00, 0.05 and 0.10. The k-values considered are 50 and 300 (representing standard deviation increases by factors of more than 7 and 17 times).

Carroll and Ruppert (1982) considered n = 41, using p = 0.10 and k= 9, corresponding to quite moderate random extreme heterogeneity. They combined this with two smooth dependences (similarly coded): a linear dependence (1, 9) and an asymmetric quadratic dependence (1, 25, 81). The magnitudes of these smooth dependences are comparable with some of those used in the present paper, but Carroll and Ruppert's overall exploration of combined heterogeneity is much more restricted than presented here. The situations considered by Birch and Binkley (1983) were even closer to homogeneity; their variance function was changed smoothly by changing p, the degree of contamination, as a function of x. Thus the present account covers a much wider range of patterns of heterogeneity than was previously available.

4. Examination of Least Squares versus Biweight Fitting

Let us first examine the performance of least squares fitting compared with biweight fitting for the situations presented previously. The equally weighted least squares estimate of the slope will be denoted bl,, where

bs= E ( yi _ -)(Xi _x-)/E (Xi --)2.

Roughly, to compute bbi., an iterative procedure is used which downweights points which have residuals that are much larger in absolute value than those of the majority of points. These residuals come from an initial fit using the usual three-group resistant line. (Details of the computation of bbim used here are given in Appendix A. For a more detailed description of bbim see Mosteller and Tukey (1977). The algorithm used for the three-group resistant line is described in Appendix A.)

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 9: Robust, Smoothly Heterogeneous Variance Regression

346 COHEN, DALAL AND TUKEY

TABLE 3 Relative efficiencies of bbit to bls , bauto to b1s, bauto to bbi,,t and bnew to bauto for a variety of situations t

Shape Results for Gaussian case Results for 5% random divergence at factor 50 bbiwt lbls bauto /bls bauto /bbiwt bnew /bauto bbiwt /bls bauto/bls bauto/bbiwt bnew/bauto

1, 2501, 1 1.11, 1.22, 1.10, - 3.43, 3.84, 1.12, - 1, 251, 1 1.10, 1.21, 1.10, - 3.42, 3.81, 1.11, - 1, 26, 1t 1.08, 1.14, 1.05, - 3.34, 3.56, 1.07, - 1, 9, it 1.07, 1.11, 1.04, 1.07 3.25, 3.46, 1.06, 1.05 1, 3, 11 1.00, 0.99, 1.00, 0.98 3.02, 3.02, 1.00, 0.98 1, 1, it 0.95, 0.94, 0.99, 0.94 2.91, 2.87, 0.99, 0.94

2501, 1, 2501 0.78, 1.07, 1.38, - 2.62, 3.59, 1.37, - 251, 1, 251 0.78, 1.05, 1.34, - 2.63, 3.50, 1.33, - 26, 1, 261 0.84, 0.96, 1.14, - 2.76, 3.15, 1.14, - 9, 1, 9t 0.89, 0.98, 1.11, 1.01 2.70, 2.97, 1.10, 1.01 3, 1, 3t 0.92, 0.93, 1.01, 0.98 2.80, 2.84, 1.01, 0.98 1, 1, 1t 0.95, 0.94, 0.99, 0.94 2.91, 2.87, 0.99, 0.94

1, 10001 0.94, 1.14, 1.21, - 2.76, 3.41, 1.24, - 1, 1001 0.94, 1.14, 1.21, - 2.76, 3.39, 1.23, - 1, 101 0.94, 1.10, 1.17, - 2.80, 3.32, 1.19, - 1, 21T 0.94, 1.10, 1.17, 1.09 2.97, 3.51, 1.18, 1.08 1, 61 0.94, 0.98, 1.04, 1.02 2.98, 3.14, 1.05, 1.03 1, 1i 0.95, 0.94, 0.99, 0.94 2.91, 2.87, 0.99, 0.94

tThe standard deviations of efficiencies are approximately 0.01.

Table 3 contains the ratio of the variance of bls to the variance of bbi1t for two cases: no random divergence and 5% random divergence by a factor of 50. (Results for p = 0.10 with variance increases of 50 and 300 and forp = 0.05 with a variance increase of 300 are qualitatively similar to that shown for p = 0.05 with a variance increase of 50.) Examining the first number in each group in the right-hand columns of Table 3, we see that the cost of not using bbi%, in the presence of random extreme divergence from homogeneity is extreme. In the presence of 5 % 'outliers' of variance 50 times the smooth dependence, the variance of bls is between 2.5 and 3.5 times that of the variance of bbi.,t, equivalent to sample size reductions by such a factor.

For the remaining three of the five (p, k) combinations (not displayed), for the marked designs of Table 2, we found ratios of variances from 4.4 to 5.1 for the combination (10%, 50), from 13.1 to 15.4 for (5%, 300) and from 23.0 to 28.0 for (100o, 300). The relative inefficiency of least squares in the presence of observations with greatly increased variance has been the predominant reason for the development of bbi1" and other robust estimates of slope and is well understood. Table 4 provides a simple approximation for these efficiencies. The effect on the variance of any observation resulting from multiplication by k with probabilityp is to multiply it by a factor of 1 -p + kp. Table 4 shows that this approximation, denoted 'naive', to the variance increase of bls relative to bbi^,, is generally too high, which is expected since the biweight is imperfect at identifying outliers. This lack of full effectiveness is captured by the 'improved' approximation, 1 + 0.8p(k- 1), roughly arguing that the biweight is 80% effective in identifying outliers.

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 10: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 347

TABLE 4 Summaries and approximations to efficiencies of bbi, relative to bl, for the eight marked situations of Table 2 t

Outliers Relative efficiencies Ratios p Scale Naive Improved Observed Observed/naive Observed/improved

0.05 50 3.45 2.96 2.7-3.3 0.78-0.97 0.91-1.13 0.10 50 5.95 4.96 4.4-5.1 0.71-0.86 0.88-1.03 0.05 300 15.95 12.96 13.1-15.4 0.81-0.97 1.01-1.19 0.10 300 30.95 24.96 23.0-28.0 0.74-0.90 0.92-1.12

0.00 - 1.00 1.00 0.89-1.07 0.89-1.07 0.89-1.07

tNalve approximation, 1 - p + kp = 1 + p(k- 1); improved approximation, 1 + 0.8p(k- 1).

Examining the first number of each group in the middle columns of Table 3, we see that bl, has a better performance than bbi.,,, for both the bowl-up shape and, to a lesser extent, linear heterogeneity in the absence of random extreme divergence. However, bbi,^,, is more effective than bl, for bowl-down heterogeneity. This effectiveness of bl, in comparison with bbi., for various heterogeneity shapes is somewhat surprising since the same mechanism that permits bbiwt to downweight points with large variances might also downweight points with moderately larger variances than nearby points. Evidently, this depends greatly on the shape of the heterogeneity. A referee has suggested that the reason for the biweight's inability to do well for bowl-up heterogeneity is that the points with increased variance, which the biweight is ignoring, are those with the largest influence. In this type of situation it is essential to weight the outside points properly. In the bowl-down heterogeneity case, the influential points are those with the lowest variance, and downweighting the middle points is not as crucial.

We thus have two estimates, one which performs well when there is random extreme divergence from homogeneity and another which is often more effective in the presence of systematic moderate divergence. The natural question is whether it is possible to make use of the information about the variances in the residuals to maintain the effectiveness of the biweight in the presence of random extreme divergence and also either to outperform (or at least to perform comparably with) bls with moderate systematic divergence, while losing as little ground as possible when a2(x) is the same everywhere.

5. New Estimate, bauto

We now outline a procedure for estimating f3 in the presence of random extreme divergence from homogeneity as well as systematic moderate divergence. If one intends to use residuals to make inferences about the size of the variance at the xi it is necessary that the residuals result from estimates of ax and ,3 that are not greatly affected by random extreme divergence from homogeneity. As seen earlier, the biweight regression line has this property. So the initial step is to use the residuals from the biweight regression line. Then we calculate a moving M-estimate of the absolute residuals, based on a moving window of length m (set to 21 in the example discussed

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 11: Robust, Smoothly Heterogeneous Variance Regression

348 COHEN, DALAL AND TUKEY

earlier), to obtain a smooth value, neglecting outliers. (We use absolute residuals rather than squared residuals because their distribution is less skewed and therefore averages better. There is a good argument for using the absolute residuals to the power 2 instead.) The moving M-estimate makes use of two weightings. The first weighting is the standard weighting used in M-estimation, i.e. each absolute residual is weighted on the basis of how extreme it is among the other absolute residuals for xs in the window, more extreme absolute residuals receiving less weight. The second weighting is based on how distant the corresponding xj is from the point of interest, xi. Qualitatively, these weightings resemble those in Cleveland's 'lowess' smoother (see Chambers et al. (1983), p. 94 ff.) which could probably be used quite satisfactorily in place of our weighting scheme. This results in I rism l, smoothed absolute residuals.

The goal is to modify these I rism I such that using the squared reciprocal of the modifications, I rimdl -2, as weights in a weighted regression will give a reasonable estimate of f3. Without modification, observations which are individually outlying will result in moderate values of I ri'm I due to the robust smoothing and, if used to define weights directly, will unfortunately produce substantial weights in the resulting weighted regression. Therefore we make use of a procedure which continuously identifies outliers, replacing the corresponding smoothed absolute residuals by combinations of them and the raw absolute residuals, resulting in I r,od I (which, incidentally, will no longer have a smooth appearance). The resulting estimate of slope from the weighted regression is denoted bn,w and the resulting intercept estimate is anew.

Estimate bnew succeeds in making good use of the information about U2(x,) contained in the residuals. However, for systematic moderate divergence close to constancy, the cost seemed rather high with respect to bbiw, and bl,. Therefore we made use of a measure of smooth heterogeneity to weight together bbi", and bnew in such a way that, in situations which appeared to have no systematic divergence from constancy, the result reduced to bbiwl. The final estimate is denoted bauto Details of the computation of bnew and bauto are given in Appendix A. The trade-off of efficiency of bnew relative to bauto for various heterogeneous functions can be examined in Table 3, where the fourth value of each group provides the efficiency of bnew relative to bauto.

The second entry in each group of values in Table 3 compares the performance of bauto and bl, for a2(x) with no random extreme divergence from constancy and for 5 % outliers with variance increased by a factor of 50. Clearly, except for bowl-up shapes combined with no outliers, and then only for such shapes with very moderate (or no) heterogeneity, bauto is superior to bl,. The cost, 940/o efficiency for the completely homogeneous case, may seem to some rather high, but, especially for bowl-down and linear patterns, the cost is clearly likely to be well worth it.

Thus, the procedure presented here can incorporate much more of the information about the variance from the residuals than the biweight; in fact, enough so that bauto seems preferable to bl, even in the absence of random extreme divergence from constancy.

The remaining question is to what extent we have lost any performance by using bauto rather than bbiwt by trying to incorporate the information about systematic variation in U2(x). The third number in each group of numbers presented in Table 3 compares the performance of bauto and bbi,, for the same two situations and supports the general superiority of bauto to bbiw, as an estimate of 13 for a variety of combined departures from the assumption of constant variance. All the values listed are 0.99 or

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 12: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 349

greater. As a result, bauto can be recommended over bbh , and (especially) over bl, in all the situations that we have studied.

6. Discussion Some issues concerning bnew and bauto have been investigated, including estimation

of the variance of bnew and bauto, other possible measures of heterogeneity and the associated modified versions of bauto, the effect of changes in the size of the smoothing window, the effect of non-uniform spacings of the xi, changes in the sample size, non- quadratic discontinuous heterogeneity functions and the use of long-tailed error distributions. For details, see Cohen et al. (1983), who also discuss how this approach can be extended to multiple linear regression. Here, we limit the discussion to three issues.

6.1. Sample Size For sample sizes less than about 50, this algorithm could only be used for very

uncomplicated heterogeneity and design patterns, since typically there would not be enough observations to enable the smoother to work well. For data sets of, say, 500 or more, there is much more information about heterogeneity, making these data sets an easy test for the algorithm developed here. Thus, a data set of around 100 is likely to represent an appropriate challenge to our method.

6.2. Extreme Bowl-down Challenge In the most extreme bowl-down dependence that we consider, (1, 2501, 1), just

slightly less than 94Wo of the information about the slope is contained in the two end observations. In such a situation, we need not be afraid of giving these observations too much weight (we can, at most, lose 60/o). If we knew that we were in such an extreme situation, we could use the squared reciprocal of the raw residuals as our weight. But we shall only rarely be in such a situation, so we do not dare to do this generally. If we are to use information from other, near to the end, observations, however, we must be careful to extrapolate (to i= 1 or i= n, as the case may be). Presumably we should extrapolate smoothed values, plausibly of squared residuals. If this gives an extrapolated squared residual (e.g. for the leftmost observation) less than, say, a fifth of the next to the end squared smoothed residual, then we can reasonably use the lesser of the extrapolated and raw squared residuals. Otherwise we should use the extrapolated squared residual as the reciprocal of our weight.

The outlier problem is exacerbated if, in the outlier-free case, most of the information belongs to only a few observations. Their variances, which we cannot estimate effectively, even in the Gaussian case, will dominate all else. So we must resign ourselves to not detecting when these observations are outliers, which implies accepting essentially all the variance increase (by a factor of 1 -p + kp = 1 + (k - I)p) from the random presence of greatly enhanced variance.

6.3. Other Deviations from Utopian Conditions As far as we can tell, the algorithms just presented probably deal effectively with

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 13: Robust, Smoothly Heterogeneous Variance Regression

350 COHEN, DALAL AND TUKEY

(a) stretched tail distributions of perturbations, (b) random outliers and (c) smoothly varying underlying variance.

They do not attempt to deal with, among others,

(d) non-linear dependence of y on x, (e) discreteness of the distribution of y or (f) asymmetry of perturbations.

Little seems to be known about what might be needed in dealing with problems (e) or (f). A possible non-linear dependence on x is quite frequent and is possibly dealt with by inserting a new piece of algorithm into the third segment of the calculations of bnew (see Section A.2).

Acknowledgements

The authors are grateful to the Editor and two referees for suggestions that greatly improved the presentation of this paper.

Appendix A: Algorithms Used in Simulation

A. 1. Computation of bbiw, Calculate the simple resistant slope (denoted bre).

(a) LetyL = median{ yl, . . . ,Y[nl31} andyR = median{y[n/3+11,. . .,Yn} where [z] denotes the smallest integer ? z.

(b) bre(1) = (YR- YL)/(X[2n/3 + 1] X[n-3])X (c) Let TLj = median{ y1 - bre(k)xl, . . ., Y["n3i - bre(k)X[in3i } and define TR analogously

for the right-hand third. (d) Then bre(k+ 1) = bre(k) + (TR -TL)/(X[2nl3 +1] -X[n3) (e) Iterate until k= 5 or bre(k+ 1) - bre(k) I < 0.01. Denote the result by bre. (f) Let the intercept are = median{Yi-brexi}, i= 1, * , * n.

A better algorithm is found in Johnstone and Velleman (1985); we recommend its use, and would expect no appreciable change in overall performance.

Next calculate a biweight regression starting at bre.

(a) Let abiw,(l) = are and bbiw,(l) = bre. (b) Let SCALE(k) = 7 median{ Iyi - bbiw1(k)x - abi,1(k) I }. (c) Let w1(k) equal (1 - u7)2 for u7 > 1 and equal 0 elsewhere, where u1 = { y1 - bbiwt(k)xi

- abjwt(k)}/SCALE(k). (d) Then

bbiwt(k + 1) = E w1(k)(x, - _w)y/Z wi(k)(x_ - )2

where .i, = Ew1(k)x1/Ew,(k). (e) Let abi,4(k + 1) = Ewi(k){ yi - bbiwt(k)Xi }/E w(k). (f) Iterate until k = 10 or I bbi%1(k+ 1) - bbbiw(k) I < 0.01.

The factor 7 used in part (b) is often used in robust regression.

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 14: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 351

A.2. Computation of bauto Calculate bbiv in Section A. 1. Then smooth the standardized absolute residuals by using a

moving M-estimate. (a) Let r, = yi- bbiwtXi - abiwt

(b) Let r,' = r1/(1 - hat,)/!12 (standardized residuals), where

hat, = 1/n + (xi-_j)2/E(x?-x)2.

(c) For each xi let lo and hi be integers such that, for each xi, m = hi - lo + 1, lo < i < hi, and Ixhi - xl 0 is a minimum (with ties broken to make 1/2(xhi + xl) close to xi).

(d) Let medi = median{rj }, j e [lo, hi]. (e) Let MADi = median{ rj - med }, j e [lo, hi]. (f) Let w11 = exp{ - (xj - x1)2/2s} (s 0.8{max(x1) - min(x1)}). (g) Let w., equal (1 _ 4j2)2 for 4j2 1 and equal 0 elsewhere, where 4j =

(Irj -med,l)/(14 MADi). hi hi

(h) IrsmI = E wiw2wiru /Z E wiJwz. j=lo j=1o

The weights in step (f) are the 'geographic' weights which give I ri' I more weight for xj close to xi. The value used, 0.8 times the range of the xs, is close to not using geographic weighting, but it can be easily reduced for situations when the heterogeneity is thought to be quickly varying.

Step (c) allows for non-uniform spacing of the design points, either moderate or extreme (when some points are replicated). We have modified the simplest definition of the window when estimating local smoothed variance to be able to use all replicatedy-values at xi or all very nearby points. This has been accomplished by adding to the nearest neighbour interval all points whose abscissae are no further than a prespecified tolerance from xi. In the example discussed, this tolerance was set to 0.

The weights in step (g) are the robustifying weights, giving more weight to I rjI Is for I rj I close to med1. The factor 14 in the robust weights is larger than values used in other applications, because the distribution of absolute residuals is skewed and we want to average more of both the longer right-hand tail and the shorter left-hand tail than would otherwise be the case.

As a simple smoothing process, we can expect I rsmI to fail to follow the true a2(x) somewhat, being lower than it should be where o2(x) is highest and higher where r2(x) is lowest. A possible cure for this, which should be examined, is to 'twice' the smoothing, which in the present context could take the following form. After carrying out all (a)-(h) put Ri = r/rsm, and repeat the whole of this iteration applied to Ri rather than to ri, this producing R?m. Then replace Irism by Irism I Rm in later steps. (We might, in particular, only shorten s in the smoothing of R1.)

We point out that since 2 r3 12' has a more symmetrical distribution than I riI it might be beneficial to smooth I ri 12/3 instead of I r, I .

A proposed change to accommodate non-linear dependence on x (which we have not simulated but anticipate would be effective) is to replace step (d) as follows.

(i) Smooth the residuals and fit a simple non-linear adjustment. (ii) Smooth the values of medj, perhaps first with 3RSS (a repeated median smoother),

twice, and then with a 'detrivializer' (see Tukey (1986) for definitions). (iii) Plot the smoothed values against x. (iv) Fit an additional constant regression, perhaps of the form

smoothed med, = A + Bx + C exp(cx) - exp(c*) C

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 15: Robust, Smoothly Heterogeneous Variance Regression

352 COHEN, DALAL AND TUKEY

where * is the median of the xs, and, perhaps, c is chosen to make max{c I x - xI } = 0.5.

(v) If the estimate of C is significantly different from 0, put

exp(cx) - exp(cx) r, = Yi - bbiwtXi abiwt C epcx-ep(k C

and return to step (a). Next identify outlying I rj61 and leave them unsmoothed. (a) If riSm2 > ri2 then y,= . (b) If riTS2 < ri2 then yi= min{(1, r,2 - riSm2)/(14 MADi2} rimd2 = i + (1iriSm2 Here the algorithm is written in terms of r2, which for estimation of au itself may be more

natural than the use of I r6' I, or I I 2/3. If r2 is much greater than r,sm2, then rimnd2 = r!2 The smooth transition between r2 and ris should help to protect us against relatively rough heterogeneity.

Now calculate the weighted (least squares) regression, with weights I/rimod2.

(a) Let wI = 1 /rimod2, XW = E w,*x1/E wi*. (b) Then bnew = S wi*(xi _j)yi/2w1i*(xi1_)j.

Finally, combine bbiwt with bne,w if the smooth heterogeneity may be no more than slightly varying.

(a) Let

d = med [0, 1, {T E (I| r I /sd - 2/7r)2/(T -2)- (count - 2)] /V{2(count - 2)}]

where the sum is over the residuals with 'y < 0.25, count is the number of residuals with ,yi < 0.25 and sd is the standard deviation of residuals with 'Y1 < 0.25.

(b) Then bauto = dbnew + (1 -d)bbiwt

The unfamiliar numbers found in step (a) provide adjustments to more nearly constant location and scale. If we know that heterogeneity is substantial, bnew is preferable to bauto, delivering up to 1O0o more efficiency.

References

Amemiya, T. (1973) Regression analysis when the variance of the dependent variable is proportional to the square of its expectation. J. Am. Statist. Ass., 68, 928-934.

Belsley, D. A., Kuh, E. and Welsch, R. E. (1980) Regression Diagnostics. New York: Wiley. Birch, J. B. and Binkley, D. A. (1983) Effects of non-normality and mild heteroscedasticity on

estimators in regression. Communs Statist. Simuln Computn, 12, 331-354. Bloch, D. A. and Moses, L. E. (1988) Non-optimally weighted least squares. Am. Statistn, 42, 50-53. Carroll, R. and Ruppert, D. (1982) Robust estimation in heteroscedastic linear models. Ann. Statist., 10,

429-441. Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A. (1983) Graphical Methods for Data

Analysis. Belmont: Wadsworth. Cohen, M., Dalal, S. and Tukey, J. W. (1983) Robust, smoothly heterogeneous variance regression.

Technical Report. Bell Laboratories, Murray Hill. Cook, R. D. and Weisberg, S. (1982) Residuals and Influence in Regression. New York: Chapman and

Hall. Giltinan, D. M., Carroll, R. J. and Ruppert, D. (1986) Some new estimation methods for weighted

regression when there are possible outliers. Technometrics, 28, 219-230. Hald, A. (1952) Statistical Theory with Engineering Applications, p. 564. New York: Wiley. Huber, P. J. (1981) Robust Statistics. New York: Wiley.

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions

Page 16: Robust, Smoothly Heterogeneous Variance Regression

VARIANCE REGRESSION 353

Johnstone, I. and Velleman, P. (1985) The resistant line and related regression methods. J. Am. Statist. Ass., 80, 1041-1059.

Mosteller, F. and Tukey, J. W. (1977) Data Analysis and Regression. Reading: Addison-Wesley. Rutemiller, H. C. and Bowers, D. A. (1968) Estimation in a heteroscedastic regression model. J. Am.

Statist. Ass., 63, 552-557. Tukey, J. W. (1986) Thinking about non-linear smoothers. Technical Report 291. Department of

Statistics, Princeton University, Princeton.

This content downloaded from 128.235.251.160 on Wed, 17 Dec 2014 10:52:42 AMAll use subject to JSTOR Terms and Conditions