on the estimation of regression coefficients with a …boos/library/mimeo.archive/... ·...
TRANSCRIPT
. 1\
•
• ON THE ESTIMATION OF REGRESSION COEFFICIENTS WITH A
COORDINAIEWISE MEAN SQUARE ERROR CRITERION OF GOODNESS
Burt S. Holland
Institute of StatisticsMimeograph Series No. 693July, 1970
iv
TABLE OF CONTENTS
LIST OF TABLES .
1. INTRODUCTION
..Page
v
1
1. 1 The Model • • . . . • . • . • 11. 2 Mu1 tico11ineari ty . . . • . 21.3 Estimation with a Mean Square Error Criterion of
Goodness . . . . . . . . . . . . . 31.4 Multivariate G~nera1izations of Mean Square Error. 6
2. REVIEW OF LITERATURE .
2.1 The Stein-James Estimator.2.2 Test-Estimation PrQcedures2.3 Ridge R~gression
9
91011
3. ALTERNATIVE ESTIMATORS OF P
3.1 Construct~on of the Estimators
13
13
3.1.1 b2
3.1.2 b33.1.3 b
43.1.4 b53.1.5 b q
. . .. , ..• •
14
17
17
19
23
3.2 Asymptotic Distribution Theory of the Estimators 233.3 Relative Mean Square Efficiency of the Estimator&I
A Simulation Experiment . . 313.4 Discussion of the Es~imaforS 35
..
II
4. SUMMARY, CONCLUSIONS AND RECONMENOATIONS
4.1 Summary.. • •......4.2 Conclusions and Recommendations.
41
4143
5. LIST OF REFERENCES ..... , .. 48
6. APPENDIX: THE SIMULATION DESIGN AND PROGRAM . 50
..
3.1
3.2
LIST OF TAB~ES
Estimated relative efticienGies EZ' ~3' and ES' basedon N ~ 500 itera~ion~ . . . .
Relative efficiency E6 for var~ou~ values of qj
v
Page
33
37
1. INTRODUCTlON
1.1 The Model,
Consider the linear model
=: X~ + e , (1. 1)
is an n ~ p matri~ of known fixed quantities with rankwhe:re X ;: [x tj ]
p =: n, P ;: (~l' ... , p ) I a p x 1 vector of unknown parameters top
be estimated, and e an n x 1 vector of random variables distributed
with mean vector 0 and dispersion matrix ~2I, ~2 unknown. The Gauss-
Markov Theorem (Graybill, 1961, pp. 115_116) states that for this model
the minimum variance linear unbiased esti~ator of p is given by
with dispersion matrix ~2(X'X)-1. By "minimum variance" of the vector
(b lj ) ~ va~ (b!j) for each j ;: 1,
·b* ) I i h l'... , . s any ot er tnear un-lp
estimator b1
we mean here that Var
* tr *2, ... , p, where b1 ;: (bll' b12 ,
biased estimator of p.
The minimum varian~e quad~a~ic unbiased estimator pf ~2 is given
•
by
(1. 3)
2
When interest lies in constructing confidence regions or tests
2of hypotheses concerning functions of ~ and ~ , further specification
of the distribution of e is necessary. If, as is often the case, we
can assume that e follows a multivariate normal probability law, b1,,2
and ~ are the jointly sufficient maximum likelihood estimators of
their expectations."2 2
Also un~er this assumption, (n - p)~ / ~
..
2 "2 4X (n - p) and var (~) = 2~ / (n - p).
In this thesis we are concerned with the estimation of ~J where
in order to elucidate insofar as possible the mechanism underlying the
model, primary interest lies in efficient structural estimation. We
are less interested here in improved prediction of the response
corresponding to a given p_tup1e (x tlJ x t2 ' "'J x tp) or in judiciously
choosing such a p-tup1e to optimize the response y.
1.2 Mu1tic911inearity
Despite its desirable properties, the estimator bl
occasionally
leaves a lot to be desired. If X (and hence XIX) is nearly singular
so that IXIxl is "small," the variances of the {blj } get very large
and the estimators themselves become quite sensitive to changes in
specification of the model. This ill-conditioning of the model, often
referred to as multicollinearity, has long been a prickly problem for
investigators in all fields of application.
There are several avenues of approach to the multicollinearity
problem. The most satisfactory one is to insist upon additional
information (~.~., more sample data or elaboration of the model). If,
as is often the case, such information is unobtainable, some investi-
gators would consider the possibility of reducing the dimensionality
3
1of the problem; ~'~') discarding some of the regressors. However) if
the theoretical considerations underlying the construction of the model
are not to be neglected) this approach is inappropriate when the
statistician's aim is structural estimation. The incorrect omiesion of
an important though multicollinear variable from the list of independ-
ent variables introduces a perceptible bias in the estimation of the
, , ff" 2rema1n1ng coe 1C1ents.
This thesis follows a third route in considering some slightly
biased estimators of~. In particular) we consider alternative
estimators obtained by modifying the criterion of goodness from linear
minimum variance unbiasedness to emallness of mean square error
(m.s.e.). It is felt that in practice) few people would seriously
object to this minor change in loss structure.
1. 3 Estimation with a Mean Square Error Criterion of Goodness
For estimating the scalar parameter e) the m.s,e. of the estimator
e (having finite variance) is given by
E([e - E(e)]
A 2 A
var(e) + bias (9) .
Implicit in the adoption of this risk function is the acceptance of
1See Draper and Smith) 1969.
2See Farrar and Glauber (1967) for an excellent account of themulticollinearity problem and some proposed remedies.
4
bias in the estimation if the reduction in variance surpasses t~e
newly introduced squared bias term. When there is substantial
multicollinearity in the model (1.1), the large variance of the
estimators means that even a small percentage reduction in variance'
can be appreciable in absolute terms.A
Whereas minimum variance unbiasedness stipulates that 9 be made
close to E(e) subject to the condition E(S) e, the mean square
A
error criterion requires that e be close to e itself. Both
(9 - E(8))2 and (e _9)2 are attractive loss functions in that
they possess the property of convexity. But in practice, minimum
variance unbiased (m.v.u.) estimators are usually easy to construct
while minimum m.s.e. estimators are not. For unbiased estimation with
a strictly convex loss function, the Rao-Blackwell Theorem (Fraser,
1966, p. 57) gives an explicit procedure for construction of the
unique m.v.u. estimator when there exists a complete sufficient
statistic for the- family of densities (Pei e e ~}. There is no
analogous result for minimum m.s.e. estimation.
As an illustration of the difficulty tn obtaining minimum m.s.e.
A
estimators, suppose ~ is the m.v.u. estimator of a scalar parameter
~ and one asks what constant "c" will minimize
2 A 2 2= c var(~) + (c-l) ~ . (1. 4)
2 2 A
The optimum c is clearly ~ ; [~ + var(~) J, so that
A 2; 2 A~~ [~+ v~r(~)J (1. 5)
5
has smaller m.s.e. than~. However} ~ cannot be computed without prior
2 "knowledge of the value of the ratio ~ / var~. This dilemma occurs
when we wish to use a modification of the arithmetic mean of a random
sample of size n to estimate the unknown mean of a populatio~ haVing
unknown variance. Such interference of "nuil3ance parameters" has
led Kendall and Stuart (1967) p. 22) to state}
Minimum m.s.e. estimators are not much used} butit is as well to recognize that the objection tothem is a practical} rather than theoretical one.
If it should happen that var ; = ~~2} ~ known} we can get a minimum
m.s.e. estimator for~. As an example} consider the estimation of
2rr in the model (1.1) with normal errors:
~2 "2 2 2 2 2 4rr "" rr (rr) / [(rr) + 2rr / (n - 1)]
= (n _ 1) ~2 / (n + 1)
and
-2 2m.l3.e. (rr ; rr) = 2CJ4
/ (n + 1)
< I.rr4/ (n _ 1)
"2 2::: m. $. e. (cr j cr) .
Where minimum m.s.e. e$timators do not exist} it may I3till b~
possible to construct estimators that have smaller m.s.e. than
traditional estimators over ~ wide range of values for various un-
known parameters. We shall se~ that this is the caSe when estimating
13 in (1. 1) .
6
1.4 Multivariate Generalizations of Mean Square Error
Thus far the discussion has been confined to the estimation of a
single parameter. Upon moving to the multivariate problem, we must
adopt a suitable gen~ra1ization of m.s.e. to the case of a vector
estimator e= (a1, 82, •.. , 8p) I of a vector parameter
9 = (91' 62, ... , 6p
) 1.
One attractive generalization that has been proposed (Bhattacharya,
1966) is
(1. 6)
where W is a p x p symmetric positive definite matrix of known weights.
(This guarantees that the cprr~sponding loss function is non~egative
and convex.) W is often taken to be D, a diagonal matrix of positive
elements, or more particularlY, the identity matrix:
(1. 7)
Geometrically inclined readers will refer to (1.6) as the expected
(squared) distance between e and 6 in the metric of W.
Instead of employing a ~ing1e criterion such as (1.6) we shall,
in this thesis, Gonstrvct vector estimators by simu1taneQus1y attending
to the p univariate proo1ems of rendering (m.s.e. (e.; e.), j = 1,J J
2, ... , p} as small as possible. Thus we would call eIIbetter fl than
E(6. _ 6.)2J J
< (1. 8)
7
for every j = I, 2, ... , p, and at least one of the inequalities is
strict. If eis better than eaccording to (1.8), it is better also
according to (1.6) if W =D (but not necessarily if W f D). However,
the converse of this ~tatement is ~learly false.
The criterion (1.8) lacks the g~ometric appeal of (1.6). It
fails to take account of the cross product terms (E(e. - e.)dL, - eLI)},J J J J
an omission th~t may not be w~rranted.
The difficulty with the employm~nt of (1.6) lies in the dilemma
of choosing a satisfactory W. Unless W is judiciously chosen, (1.8)
may be preferable to (1.6) in that if the latter is used:
(i) Some of the (e.) may be poo~ly estimat~d (although othersJ
are nQt).
(ii) Unjustified emphasi$ may be placed on the efficient estima-
tion of some subs~t of (e.) relative to another subset.J
(iii) The criterion is not scale invariant.
In view of our declarep aim of estimation rather than efficient pre-
diction or optimization, it seems unwise, in the absence of further
information, to consiqer a means of estimation that may perform
well for one part of the model to the detriment of another. 3 Our
criterion attend~ to the estimation o~ each ~. apart from that fo~J
the remaining elements of~. Recall too that the Gauss-Markov Theorem
chooses ~ = bl to minimize each E[P j - ~(~j)J2 rather than
~ ~ A ~
E[ f3 - E(f3) ] 'W[ f3 - E(f:J)] for some W,
3The m.s.e. of prediction of a "futqre" response Yt corresponding
to a stochastic p-tuple (~tl' x t2 , ... , x tp) with dispersion matrix W2 ~ . A
is given by ~ + E(f3 - ~) IW(~ - f3).
An even more restri~tivecriterion.than(1.8) is to caH e"better in m.s.e." than eiff m.B.e. (h'S) ~m.s.e. (h'~) for every
h (p x l)} or equ;iv~lently} iff [E(e - 6) (6 - e) 1 - E(e - e) (9 - 6) I]
is a positive semi-definite matrix.4
Fraser (1966) ~. 60) states a multivariate generalizat~on of the
Rao_Blackwell Theorel\l whi~h elllplors the notion of "ellipsoid of
concentratio~' as the lll~ltiva~i~tQ generalization of varianGe.
4See Toro-Vizcarrondo an~ Wallace (19p8) p. 560).
8
9
2. RE;VIEW OF LlTERATURE
2.1 The Stein-J~m~s E~ttmator
Like most investigators) Ja~es and Stein (19Q1) have preferred to
use criterion (l..7). For the special case where XIX =;: I (~:.~.:)
orthogonal polynomial regressi~n») 6 mu1tivariat~ normal) and p ~ 3)
they have shown that
~(y) =;: [1 _ y;.2 ! Y'X4CY]X'Y (2.1)
is uniformly (in 13 pnd i) bett~r than b1 =; X'Y) for y any positiv~
number less than Z(p-2)(n-p)!(R_p+2») a.nd thatm.s.e. [~(Y)J 13] is
minimized by taking 'Y =;: (p - 2) (n _ p) ! (n _ p + 2) . The coefficient
[1 - y~2! Y'XX 'Y] will be between 0 and l. for all admisl;lib1e y
whenever
Thus the Stein_James estimator (2.1) is si~ply a (scalar) constant
multiple of b l . They merely prQve that P(y) is better than b1)
omitting the motivation behind its construction. Normality is not
necessary to render bl inadmissible) ~ut in its absence no alternative
estimator has been proposed.
Baranchik (1970) has generalized (2.1) to allow y to be a certain
bounded function of F •p)n"TP
10
Stein (1956) has also shown that bl is admissible when p $ 2 [the
risk function still being (1.7)]. This means that in the two re-
gressors case with criterion (1.8), we cannot simultaneously imp~ove
2upon bl1 and b12 for all possible values of ~ and ~ .
Sc10ve (1966, 1968) has pointed out that P(y) generalizes to the
case XiX f I provided that we take W ~ XIX. Thus (1.6) now appears as
m.s.e. (6; ~) ~ E(j - ~) Ix'x(i - ~) . (2.2)
Since XIX is proportional to the inverse of the variance_covariance
matrix of b1, this choice of W removes objection (iii) in Section 1.4,
and goes a long way toward the withdrawaL of (i). Bhattacharya (1966)
has indicated that for Wf I and arbitrary XIX, we can at least trans-
form the problem to the W ~ 0 Gase.
Sc1ove's paper (1968) surveys all the literature discussed in
this section in ord~r to interpFet some highly mathematical res~lts
for the benefit of applied statisticians.
2.2 Test-Estimation ProceduresI j
Consider the model
(2.3)
which differs from (1.1) with p ~ 2 only in that the vaFiables are now
d f h · 5correcte or t e~r means. Bancroft (1944) has discussed the
estimator ~l of ~l specified as foll~ws. Perform the level ~
Student's t test of the hypothesis H: ~ ~ ° vs H: P2. f 0,o 2 a
5The distinctio1;l between (2.3) and (1.1) will be further discussedin Section 3.3 below.
11
Then let
=n.E (xlt - xl) (Y t - Y)
t=l ,
if H is rejected.o
if H is accepted.o
Toro_Vizcarrondo (1968) has invest~gat~d the same estimator where
6a m.s.e. test is used in place of Stude~t's t.
Baranchik (1964) ~onsidered a ~odification of the Stein_James
"estimator where ~(y) is taken to ~~ the null vector when a preliminary
F test of H :o IJ = 0 vs H:a is accepted.
These so-called "j:est.,.~st~mation" p:J;'ocedures are actually more
akin to the "discard~n& regressors" approB,f;:h to multicollinearity
than they are to the "n~w esHlllator" procedure to be investigated in
the next chapter.
2.3 ~idge Regre~sion
Hoerl and Kennard (1970a~ present th~ estimator
6See Toro-Vizcarrondo and Wallac~ (1968).
(2.4)
12
where the scala~ k is chosen so as to stabilize the syst~m by makin~
the estimator less sensitive than bl to small changes in model speciei
cation. It is demonstr~ted that w~th risk as in (1.7) there are
admissible values of k ~uch that ~(k) is better than bl
. However) no
explicit expression for k is presented. The authors suggest ~hat its
choice be base~ on a graph of the (~j(k)} vs k (called a "ridge
trace"); that k be the smallest value such that for k"( > k) the
(~j(k*)} are nea~ly independent pf k*. They claim that the k that
one will employ in practice will only slightly incr~ase the error sum
of squares [Y ~ X~(k)] I [y ~ X~(k)]. It i~ also noted that (2.4) is the
Bayes estimator ot p when the parameter vector has a pr~or distribution
wi th mean 0 (p x 1) and dispersion (cr2/k) I (p:x;p).
A second paper by the sam~ authors (~oerl and Kennard) 1970b)
contains illustra~ions of the performance of uhe ridge-trace procedure.
13
3. ALTERNATIVE ESTIMATORS OF /3
3.1 Constru~tion of the Estimators,
For lack of ~ny otper me~ns ~f approach, all estimatora considered
here are essentially modifie~fions of b1.
Analogous to the situa~ion for scalar estimation discussed in
Section 1.3, we find that our modified estimators contain th~
t Q and Cf2parame ers t" That is,
In the fopm (3.1), Pis clearly of no use. Two procedures wer~
considered for th~ construction of employable estimators;
(i) In (3.1) set Cf2 ~ ~2 anc;1 let our estimator of /3 be the
solution b of the equation
to be dete~in~q QY iteratio~ or otherwise;
(ii) In (3.1) set Cf~ ~ ~~ and /3 ~ b1
to obtain
(3.2)
" "2b = /3(b~, cr) • (3.3)
Procedu~e (1) was qutck1y di~missed. For all of the new
estimators, rule~ for ~boosing good starting values to Prime the
iteration were hard to come by. Often the iteration diverged, or
converged too slowly to be pf practical use.
Procedure (ii) was the one chosen to be employed here. This has
the advantage of rendering an estimator that is a funct~on of the
sufficient statistics b1 and ~2. It see~ed intuitively desirable not
to depart from the use of sufficient statistics.
14
The "raw form" estimators E(3.1) as oppo$ed to (3.3)J to be
computed are mini~um m.s.e. estimators) but the employable es~imators
of form (3.3) are not. the ques~ion to be answered was} "Does there
exist a ~ that is lappreciably' better than h1 pver a 'wide' range of
possible values of p and (J'2?"
We shall denote the ~ive new estim,to~s by
When in the "r~w form ll Pliior to making the substitutions p = b1 and
2 A2 0(J :;:: (J ) we shall wri te the e~tim~tors as bi . The j.,; th component of
any vector h will be wl'it;ten hjor (h)j' In this section) ~n asterisk
superscript denotes the optimal value of ~ny variable.""
The reader is rem~nded th~~ the objective used to compute the
"raw form" estimator~ is sep~rCJte minimiZation of E<Pj
_ Pj)2 fo1,"
each j :;:: 1} 2} "'} p.
To construct b~} we attempt to find that p x p matrix K which
optimizes (in the indiqate~ Sense) the estimator Kb1.
o * h k*Let kj denote the j.th ~ow of K. Then b2j =kjh l } were j
" is chosen to minimi~e E(kjblZ- Pj ) .
We compute:
bias (kjbl ; Pj ) .. k~P ... Pj ;
(kjb1) 2
k (X'X)-lk~var :;:: r;rj J
...
15
(3.4)
Differentiating withrespe~t to the vector k. and setting theJ
result equal to the n~ll vector) we obtain
:::; 0 . (3.5)
Notice that the second derivative with respect to kj
is a positive
definite matrix.
Continuing from (3.5»)
whence
and
Employing the identity
(3.6)
(A ') -1 ~ A-I+ uv - - A- 1 'A- 1 / (1 ,-1 )uv + v Au) (3.7)
16
where A is a square nonsingular matrix, u and v are column vectors,
and (A + uv') is nonsingular, we can write
1 X I Xj3j3 I] XIX=-2 [ r - 2 .
0- 0- + j3'X'Xj3
Then
0= ~pl [ X'X/3/3'
b2 I - ] X'y2 2 + /3' X1X/30- cr
and
b b l X'Xb b l1 1 I IbZ =~ [ I ~ ~2 ] X'ycr ~ + b'X'Xb
1 1
bl
blX'Xb[ 1 1 I ] b'X1y=~ ... - ;"2 i Irr+ cr + blX'Xb
I I
blX'yI
=; [ ...2 ] b l .cr + b'X'y1
(;3.8)
This
17
....2Recall that biX'Y is the regression sum of squares and ~ is
the error mean squ~re in the standard Analysis of Variance t~b1e.
*At first glance it seem~ as though we have found K to be a
/....2
constant [b'X'Y (~ + biX'Y)] multiple of the identity matrix.
is not the qase) how~ver) b~cause the end result (3.8) is a consequence
of having substituted,bl for ~.
This estimator c9nstrains the K in Section 3.1.1 to be a diagonal
matrix with j~th entry mj . o *Thus b3j = mjb1j. From (1.5) we find that
The estimatprs b2 ~nd b3 are mu1tip1~cative modifications of b1.
We now consider an additive modification) b4) whose raw form is
written
b *,;:: l+DXY.
18
Like the Hoer~ and K~nnard procedure discussed in Section 2.3,
this estimator attempts to reduce the p variances by altering the
diagonal elements of (X ' X)-l.p
The estimator b4j is found by computing the minimum with respect
to D of the j-th diagonal element of the "m.s.e. matrix"
E(bl + DX;'Y - 13) (bi
+ DX'Y - /3) I
= E[(b l - 13) + DX'~f3 + DX ' s][(b1 - 13) + DX'X/3 + DX'sJ'
+ DX; I Xf3f3 IXI XD + EDX ISS IXD
+ DX'Xf3f3 IX 'XD .
h 2 I -1Apart from t e ~ (X; X) term which is free of D, the j-th diagonal
element is found to be
where Zj is the j_th row of XIX, dj the j_th diagonal element of D, and
[Sj.t] =X1X.
Thus the j_th diagonal element of n* is
19
Therefore}
~4J' =: blJ
· - i(x'y). / «()2~ .. + z.f3f3'z~). J JJ J J
and
"2 n "2 n 2- () r: xt·Yt / [() s .. + ( r: xt·Y t) ] }
tr::::l J JJ t=:l J(3.9)
Attempts to combine the p equations (3.9) into a compact e~pression
for the b4 vector were un~uccessful.
The fourth propo~ed estimator ~s radically di~ferent from the
first three in both appearapce and ~opoeption. We consider first the
case p =: 2:
2Yf~+1=: ([ I - S (I _ SX I XS) J -1 }S ]b l j w* ~ 0
j(3.10)
-~where S is the 9ia~onal matrix with j-th~ntry s ..JJ
Abbreviating r12
as r} we see that SX'XS is simply the correlation matrix
1 r( ) }
r 1
while
(I _ SX I XS) 2w+1 :: (
o 2w+l-r
) .
20
(3.11)
Since rank (X) ::o
as w - 00 and bSj
2w+l 0-r
2, we h~ve Ir I < 1. Thus (3.11) tends to 0 (2 x 2)
* * 0 2tends to b1j as wj - 00. For wj :: 0, bSj :: (6 XrY)j'
the Gau$s-Markov estimator of ~j when 8 12 :: O. It was hoped that with
* 0a judicious choice of w. between these extremes, bs ' would be an effec-J J
tive compromise between b1j
and the estimator when the (3-j)-th Gol~m~
of X is ignored. To find w~, the value of w. that makes th~ j-thJ J
00.diagonal element of the m. s. e. matrix E(bS - ~) (9
5- ~) I as small as
opossible, we begin by computing the dispersion matrix of bs :
.f(
2 2w.+1 1 1 ~:: (J [I - S (I - SX I XS) J S- ] (X I X) - [I _ S- (I
*2w.+1_ SX'XS) J S].
One finds that
* *2 .. 2 ~.~ 4w.~;:;: (J (sJJ) s .. (1 + 2r J + r J )
JJ
*2w.+1+ 2s jj s jt Vs. s (r + 2r J
jj jt
*4w.+3+ r J )
* i(2w +2 4w +2+ (sj~ 2 sU (1 + 2r j + r j ) } ,
21
where.e, = 3 - j, j = 1,2. The "bias_cobias" matrb;;is
*2w +1(E[(b
1- (3) - SCI - SX'XS) j S-lb1]}(E[(b1 - (3)
'i~
2w.+l_ SCI .- SX'XS) J s-lb
1J}' ,
the j-th diagonal element of which is found to be
*4w , +2 1: ., ~ '.e,r J (S ../SD,)2SJJ(S'9[3. + su[3,) + (S9'/S .. ) sJ (s .. [3,
JJ ~ J~ J ~ ~ ~ JJ JJ J
o *The squared bias of bSj is clearly at a maximum when wj = 0 and
*decreases monotonically to zero as w, _ 00. Likewise, it can be shownJ
02*that var (bS ') is at a minimum (~/s .. ) when wj = 0 and increa~es
J JJ
monotonically to ~2/(1_r2)sj' as w~ _ 00. Thus it is not surprisingo J J
that m.s.e. (bS,j [3.) attains a minimum for some positive and finiteJ J
'i~
wj ' Further ca1culatiQn r~ve~ls this optimal wj to be
(3.12)
*In practice, w. is rounded off to the nearest integer. We thenJ
o 2obtain bSj from bSj by replacing [3.e, and ~ in (3.1~) wi thb1.e, and
"2~ respectively.
22
Unfortunately, it was found that except in a very special C~SeJ
bS does not generalize to p > 2 because
(I ~ SX'XS)w _ 0
as w - 00
(p x p) (3.13)
fails to hold in general. Furthermore, (3.13) is valid with decreasing
frequency as p increases. Consider for example
SX'xs '"
1
a
a
a
1
a , ..
a
a
1
(p x p)
.;
which is an admissible correlation matrix for -l/(p-l) < a ~~. ~t can
be shqwn that (3.13) hQ1ds here iff lal < l/(p-l). A general neces-
sary and suffici~nt condition for the validity of (3.13) is ~bat the
largest eigenvalue of SX1XS be less than" 2.
Also, the computation of w~ involves the solution of a (p_l)_stJ
order difference equa~ion. Its solution is intractable for p > 2.
bS does generalize when p >2 if XiX is "block diagonal" with
all blocks 2 x 2 or scalars. Then all ~j corresponding to a co~umn
of X that is one of a correlated pair are estimated just as in the
p = 2 case by igporing the remaining p - 2 columns of X. The r~maining
~. are estimated as in (3.10) for the p = 1 case--bS ' = (X'Y)j/s .. --forJ J JJ
here I - SX'XS = 1 - 1 =O.
24
It was found that the nqrm~lity assumption was necessary in order
to obtain any results concerning b~) bS) and b4
. We shall assume its
...
presence in what follows:
ordered eigenvalues of XIX .
Also) let 0 < ~1 < ~2 ~ ... < ~ be the- n - n - - pn
Definition. The sequence of p¥:p matrices (A ) = ([a~~)J) h said ton l,.J
converge to the ma~rix A = [atjJ
... ) p. (Notation: An - ~ or
Lemma 1. If
(n)if a.. _ a ..l-J ;LJ
lim A = A) •n....oo n
for each i) j = 1) 2)
then
for all non-null ~.
(p x p) as
as
n_oo
n-oo
(3.15)
(3.16)
Proof. Note that the eigenvalues of the p.d. matrix (X'X)-l are the
ijreciprocals of the (~jn) and that (3.15) s~ys s(n) - 0 a~ n - ~ for
all i) j =1) 2) ... ) p. Using a th~arem from Bodewig (1956) p. Q6»)
we have for every n:
0 < ~-l1n
< [ 2:: (si j )2 J-\. . (n);L) J
0 as n-oo )
-1~ln - 0 as n -t- 00. It follows from a result stated by Rao
[1965) p. 50) (If.2.l)J that
25
~ln
hence
..... 00 as
(3. 17)
for all non_nu11~. Then (3.16) must hold, for otherwise we reach
a contradition of (3.17).
Lemma 2. If
--n
..... A as (3.18)
where A is a finite p x p positive definite matrix, then (3.15) holds.
Proof.XiX
Let -- = A .n nThe determinant of a ~quare ~atrix is a
continuous function of its elements, hence A _ A impliesn
det (A ) ..... det (A) > O. It follows immediat~ly from this and from then
well-known formula for the inverse of a nonsingu1ar matrix A in termsn
of its cofactors and determinant ('£'':::' the "method of adjoints"),
o (p x p) .
26
(3.18) is the usual regularity condition assumed in discussions of
large sample properties of estimators in linear models. 7
Lemma 3. The sequence of random variables Un converges in probability
to a random variable U iff for some g > 0 ,
E[~]l+jun-ul
gQ as n_oo,
For a proof of this Lemma, see Loeve (1963, p. 158).
Lemma 4. If any of the three conditions (3.15), (3.16), or (3.18)
holds, then plim (;2 / blX 'Y) = O.
/"2Proof. Since biX'Y p~ ~ F '
we have, for arbitrary 0 > 0,
F'(p, n-p; AI), where A' = ~'x,x~/~2,n n
(3.19)
From Lemmas 1 and 2 we see that any of the three conditions mentioned
above implies that Al _ 00 as n _ 00.n
It is a well-known property of the noqcentral F distribution that
this enables us to conclude that the right-hand side of (3.19) converges
to zero and hence that plim (~2 /biX'Y) = O. The fact that the
denominator d.f. of F ' is an increasing function of n considerably
speeds this convergence.
7See for example Malinvaud (19p6, p. 174).
27
Corollary 1, If any of the three conditions (3.15)}. (3.16)} or (3.18)
hold) then
as n_oo.
Proof. Applying Lemmas 3 and 4 with Un
we have that
E [~2 / bJ.X 'Y
----=---- ]1 + ~2 / bJ.X I Y
_ 0as
The result follows immediately.
Lemma 5. If U is a sequence of rand9m variables and Wa constant}n
then a sufficient condition for the equalities
plim Un = lim E(Un) Wn-oo
is that E(U _ W) 2 _ 0n
as n-ooo
Proof. The consistency of U as an estimator of Wis a consequence ofn
Tcheby~ff's Inequality} while the asymptotic unbiasedness follows from
the inequalities
<
28
Theorem 1. If either of the conditions (3.15) or (3.18) holds, then
for each j = 1, 2, .,., p, b2j is a consistent and asymptotically un
biased estimator of ~ ..J
Proof. First suppose ~ 1= 0 (p xl). Then
o 2< E(b2 · - ~.)
J J
b'X'Y A2< E(bl.-~.)2 + 2E( I~.I [ 2 1 ] [ 2' () ] Ib!. -~·I}
J J J; +b'X'Y ; +b'X'y J J1 1
(3.20)
2 2"Since E(bl . - ~.) =() sJJ, (3.15) (or, by Lemma 2, (3.18)) implies that
J J
the first two terms of (3.20) tend to zero as n - 00. Corollary 1
establishes the convergence to zero of the third term of (3.20). Hence
applying Lemma 5, we have the desired result.
If, on the 0 ther hand, ~ = 0 (p x 1), then the third term of
(3.20) is identically zero and Lemma 4 and Corolla(y 1 are no longer
required for the proof.
n.....oounbiased estimator of ~.,
J
Theorem 2. If lim sjj = 0, then b3j is a consistent and asymptotically
29
The proof of this theorem is almost identical to that of Theorem 1.
2 "Z"F' (1 n-p A") where A" = R, Icy sJJ., 'n ' n t'J
.. Turning now to b4 " it is seen that to establish its asymptotic. J
unbiasedness and consistency as an estimator of /3, we merely need toJ
show that the second term on the right hand side of (3.9) has expecta-
tion zero and probability limit zero, respectively. We can rewrite
this term as
1 +
=1 + *F (I, n-p;
(3.21)
'1(where b
ljis the simple linea~ regression coefficient obtained when all
* * ,*) 4S aelements of /3 other than /3, are ignored, and F = F (I, n-p; ~ L
J n
noncentral F random variable with noncentrality parameter
=s ,)'(sl"
PJ J2
cy s, ,JJ
sZ" ',., s ,)/3J PJ. (3.ZZ)
The author was unable to show that the expectation of (3.21)
tended to zero under certain conditions; however he is fairly cer~ain
that this is the case. The difficulty arises from an inability to
separate this random variable into two components whose expectation~
can be taken separatelY' That is, we cannot (for example) claim that
< I * I *)-1E blj E(l +F
30
Although it seems plausible that Ib1j
I and (1 +F-{()-l should be
negatively correlated (since the size of J b1j
I and that of (1 + F-{() are
both directly related to I~ .J)} attempts to formally establish thisJ
result were unsuccessful. However) we can show that under a certain
regularity condition the probability limit of (3.21) is zero.
Lemma 6. If as n _ 00 ) then plim (1 +F-{() -1 = O.
Pre (1 +F*) -1 > oJ_ 0
as
in the same fashion as the right-hand side of (3.19) discussed in
Lemma 4.
Theorem 3. If s .. - 00 andJJ
*A - 00nas then
plim b4j
= ~ ..J
Proof. As noted above} we need only show that the probability limit of
From
n_oo
-~ ..J
. * ) 2 -1(3.21) is zero. Smce Var(-b1 . = (J" s .. ) s .. - 00 asJ JJ JJ
*implies) via Tchebycheff's Inequality) that p1im (_b1j
)
-{( -1Lemma 6) we have that plim (1 +F) = O. Then using a result in
"Cramer (1963) p, 255)) it follows that
o (3.23)
As for b6j (qj) = qj b1j ) it clearly has none of the desirable
asymptotic properties except in the trivial case ~j = 0,
31
3.3 Relative Mean Square Efficiency of the Estimators:
A Simulation Experiment
Since b2
, b3
, b4
and b5
are nonlinear in €, obtaining their
p.d.f. 's or even first and second moments seemed a near impossible
task. Thus) in order to assess the quality of these estimators, it
was necessary to resort to a simulation experiment.
A detailed account of the simulation is deferred until the
Appendix. The presentation in this section will encompass the salient
results of the study.
Practical limitations on the size of the experiment made it
necessary to confine attention almost exclusively to the two re-
gressors case. Conjectures concerning the nature of generalizations
to P > 2 will be made in the following Section and in Chapter 4.
We wish to ascertain how the m.s.e. for b .. (i > 1) compares~]
with that of blj
. Therefore we are interested in relative mean square
error efficiencies of blj
to bij
) i > 1:
E. -- m.s.e.(b .. ; 130) /m.s.e.(bl .; ~.) .~ ~]] ] ]
i > 1
From considerations of symmetry, it is clear that E. is independent of~
The simulation computes estimates {Ei } for a range of values of
quantities upon which the (E.} depend. It uses n = 25 throughout, but~
from ancillary experiments) it was determined that the values of the
A
Ei
are virtually stable in the range 10 ~ n < 50. For j = 1) 2)
22"let A. = ~. / ~ s]] denote the noncentrality parameter of the Student's
] ]
t test of H' ~. = 0 vs H: ~]. ~ O. Then it was determined that for0'] a r
..
..
..
32
"estimating ~jJ the Ei depend only on r, Aj
, and At as follows, where,
as before, t = 3 - j, j = 1, 2:
'" EZ(Aj , At.?E2r)
".~
E3
= E3
(Aj
)
(3.24)
A
It was discovered that EZ
depends more heavily on Aj
than At.
The estimator b4
was discarded at an early stage of the investi
gation because it was found that E2
is less than E4
for virtually all
AjJ AtJ r, and that whenever the contrary is true, it is by a very
A
small margin and moreover, E4 exceeds E3
.
The experiment assumed that € is multivariate normal. It remains
to be seen whether departures from this assumption seriously affect
the conclusions we shall draw.
A
The E. are presented in Table 3,1 (page 33) for 4 values of r,~
7 or 8 values of A (= A, or At' whichever is a more important deter-J
A A
minant of the E. in question), and 2 or 3 values of At for E2· No~
formal investigation was made of the reliability of the entriesj
to have done so would have entailed a large simulation experiment in
itself. The author is confident, however, that all entries are accu-
rate to within ± .02 and that the accuracy increases to as fine as
8± .0005 as the entries get very close to 1.
8A discussion of this statement is given in the Appendix.
34
"In addition to the E.) the actual estimators themselves were~
computed. b2j and b3j were always slightly smaller than blj
in
absolute value) with b2j
usually being very close to blj-_these
observations are actually deducible from inspection of the estimators
themselves. b5j
is in a sense the most ambitious estimator because it
Es tima tes werewas occasionally greater than blj
in absolute value.
made of the proportion of m.s.e. attributable to squared bias) and that
for b5
was often as large as .25. The proportion for b2
and b3
rarely
exceeded .15.
Some people would criticize the present model,
t (3.25)
on the grounds that the fitted regression plane is constrained to pass
through the origin rather than the point (y) xl) x2); that we should
consider instead the model
(3.26)
The simulation program was incapable of handling this setup) but the
similar model
(3.27)
was subjected to examination. This is (1. 1) for p = 3 with s12 = s13
::.:; 0, and differs from (3.26) in that b21
and b31
are both unequal to y.
The result of the simulation of (3.27) was that the performance of the
35
b2j
and b3j
were almost indistinguishable from that of the b2 and b3
estimators in 9(3. 2S) •
Finally, the rounding of w~ to the nearest integer was found toJ
have a negligible effect on the relative efficiency of bSj as compared
with its employment exactly as in (3.12).
3.4 Discussion of the Estimators
Prior'to the commencement of the simulation, it was conjectured
that b2
would be the "worst" of the proposed estimators because of
the basic simplicity of its modification of b1. The modifying factor
is a function only of b I Xly/;2, a summary statistic that indicates the
departure of the entire vector ~ from the null vector. For our
coordinatewise criterion of goodness, an estimator that separately
modified each b1j in a way peculiar to that coordinate was thought to
have a greater change of success. Moreover, it seemed reasonable to
guess that the quality of b2
would decrease as p increased because
blX1y /;2 contains a relatively decreasing amount of information about
each individual b1j as the number of coordinates grows.
The results of the simulation show that at least the first of
A A
these conjectures was far from the truth. E2
is less than either E3
...or ES far more often than when the contrary is true. For r > 0, E
2
is only rarely greater than 1 (i.e., b2
, less efficient than b1
,).- - J J
As r approaches +1, one is increasingly unlikely to encounter a
(A1, A2} for which E2 > 1, but on the other hand, the increases in
efficiency over b1j become increasingly negligible.
9This is a rather loose statement because for p > 2, it is notquite certain which rls and AIS are the relevant ones for E
2and E
3.
36
"The asymmetry of E2 with respect to changes in the sign of r is
a rather puzzling finding. In Table 3.1, the discrepancy between EZ
in
the two cases r = .7 and r = -.7 is far greater than can be accounted
for by the stated error inherent in the simulation. The Gauss_Markov
estimator is symmetric in r and intuitively) the estimation problem
seems to be of equal difficulty when the sign of r is changed. No
adequate analytical explanation has been conceived for this phonomenon.
As previously mentioned) the values b2j were very close to the
/"2corresponding b
lj. The modifying fac tor b I XI Y (b 'X I Y + 0-) can be
written as pF / (pF + 1)) where F follows the variance-ratio distribution
with p and n - p degrees of freedom and noncentrality parameter
Since F is likely to vary directly with A.) we see that theJ
modification can be expected to be considerable for ~j near zero) but
only slight for ~j distant from the origin. For small I~j I) the
"modification is in the "correct direction; II thus E2
is small for A.J
near zero.
Let us compare blj with b6j ) the family of estimators consisting
of constant fractions of blj ) j = 1) 2. From (3.14) we have
2 2qJ' + (1 - q.) A.
J J
~'~') the relative m.s.e. efficiency of blj
to b6j
is actually linear
in Aj . For purposes of comparison) we present in Table 3.2 a listing
37
of E6 for several values of q and the same set of Aj values appearing
in Table 3.1.
Table 3,2 shows that b6j
is an extremely attractive alternative
to blj for some range of qj as long as one is quite confident that Aj
does not exceed a certain value. As q. increases to 1, the relativeJ
efficiency gets very close to 1 even for small values of A., but atJ
the same time) the value of Aj
below which E6
is less than 1,
(1 + q.) / (1 - q.), increases markedly. E6 is minimized at A. = A. byJ J J JO
choosing q. = A. / (1 + A.). Therefore, unless one is an intransigentJ JO JO
minimaxer or knows that Aj is quite likely to be large, there is
probably some qj for which b6j is preferable to both blj and b2j . If
the contrary is true, the choice of some other estimator is indicated.
The evaluation of b6 can obviously be carried much further if one is
willing to attribute a prior probability distribution to A••J
Table 3,2 Relative efficiency E6 for various values of qj
~ ,5 1.0 1.5 2 5 10 100
,30 .335 .580 .825 1.07 2.54 4.99 49.1
.50 .375 ,500 ,625 .750 1.50 2. 75 25.3
,70 .535 .580 .625 ,670 .940 1. 39 9.49
.90 ,815 ,820 .825 .830 .860 .910 1. 81
.95 .904 ,905 .906 .908 .915 .928 1. 15
.99 . .980 .980 .980 .980 .981 .981 .990
38
b2
is similar in form to the Stein-James estimator, which is
applicable when X'X = I and p > 2. Using the "optimal"
y = (p - 2) (n
(2.1) to be
p) / (n - p + 2), the modifying constant is seen from
.....2(p - 2) (n.- p) cr
1 -.(n-p+2)Y'XX'Y ,
/.....2
while that for b2
is y'XX'Y (Y'XX'Y + cr ). The two are quite similar;
and one is led to speculate that the decrease in relative efficiency
due to the employment of (2.1) rather than bl is often negligible.
It is fairly safe to conclude from Table 3.1 that E2 decreases to
2the increase in A. corresponds to cr .... 0;J
This observation is easy to explain.1 as A...... 00.J
Unless IJ. = 0,J
!..=.., b'X'y / (b'X'Y +~2) .... 1,
or b2 .... bl • Similar explanations can be given for E3 and ES' (Recall
that E6 increases without bound as Aj .... 00).
Theorem 1 concerning the asymptotic behavior of b2 embodies a
regularity condition (3.1S) or (3.18). These conditions require that
the sequences (x .} > 1 do not dwell near the origin. The hypothesistJ t_
of Theorem 2 includes the weaker statement "lim sjj = 0." It seemsn-oo
improbable that any of these requirements will often fail to be met in
practice (especially when one is working with time series data).
Since for p = 2 the relative efficiency of b3j depends only on Aj ,
the quality of this estimator is unlikely to undergo substantial change
as p increases so long as p doesn't get too close to n. But the in-
crease in p should have some effect on the precision of the estimators
blj and ~2, and the value of sjj; hence on bij
/ (bi j +~2sjj) also.
39
The estimator b3j
also was not ~ priori expected to be successful
because whereas
individual blj'
contains too little information concerning the
ignores the behavior of all components in bl asideA
from blj . The fact that E3 is unaltered by changes in r leads to the
A
Note also that ES is symmetric in r.
to the two regressors case severely limits
conclusion that m.s.e. (b3
,; ~,) depends on r only to the sam~ extentJ Jsjj.as var(b
l,); i.e., throughJ --
The restriction of bS
its applicability. Certain facts that are always true when p = Z are
occasionally false for p ~ 3--two of these were mentioned in Section
3.l.4--and another is the nonexistence of partial correlations among
the columns of X until p > 2. It is intriguing that the Stein-James
estimator is valid only for p > 2; ~.~., when and only when bS
is
(in general) not.
The estimators b2, b3, and b6, as well as the Stein-James
estimator, alter bl by shifting each of its components closer to the
origin. This type of modification is the most obvious way to decrease
the variance of an estimator.
A
var (c6) Z= c var
A
( 6)
A
< var e o < c < 1 (3.28)
where c is a constant, and one would expect (3.28) to hold even if c isA
a random variable unless the choice of c and 6 is rather bizarre. This
is one explanation for the relative success of bZ
' b3
, and b6
as
contrasted with that of b4
and bS
.
40
In addition to their relatively poor performance in "improving!!
upon bl
, the latter two estimators permit an occasional difference in
arithmetic sign between blj
and b4j
. In many contexts, incorrect
estimation of the sign of ~. can be a serious error. The first threeJ
10estimators listed above perform no worse on this score than bl .
At the outset of this investigation we noted that the need for
alternatives to bl becomes especially acute as Irl t 1. Since r is
a known quantity, an alternative to bl that performs well only for
Jrl near 1 would have been just as welcome as one that is relatively
efficient for all r. Unfortunately, none of the new estimators display
any dramatic overall improvement in relative efficiency as Irl t 1.
lOaf course, if the investigator incurs a special loss fromincorrect estimation of sign, this information should be included. inthe specification of his risk runction. In practice, however, this isinfrequently done.
•
41
4. SUMMARY, CONCLUSIONS AND RECOMMENDATIONS
4,1 Summary
This thesis is an attempt to provide alternatives to best linear
unbiased (Gauss-Markov) estimation in the general linear hypothesis
model of full rank (Graybill, 1961). Alternatives are especially
desirable in the presence of multicollinearity because the variances
of the Gauss.-Markov estimators may then be excessively large. By
changing the criterion of goodness to mean square error in each
separate coordinate of the vector estimator, it is occasionally pos-
sible to construct slightly biased estimators having far smaller
variances than those of the usual estimator. It is felt that when
the statistician's aim is in efficient structural estimation (rather
than prediction), in practice, few people would have serious reserva-
tions about this minor change in loss structure.
Five new estimators (b., i = 2, .,', 6} are constructed and1
presented as prospective applicable alternatives to the Gauss-Markov
estimator (called bl
herein). Each of the proposed estimators takes
the form of a modification of bl
.
The direct determination of the quality of the new estimators
was possible only for b6
, That of the remaining four estimators was
disclosed in the two regressors case by a computer simulation experi-
ment. Statements are made concerning the prospects for generalization
of the results of the simulation to situations where there are three
or more independent variables, The results of the simulation are
presented in a table of estimated relative mean square error
4Z
efficiencies of b1 to the Cb i }. The entries in the table are found to
be dependent upon various combinations of Cr, A1, AZ}' where r denotes
the correlation between the two regressors and the CA.} are noncentra1J
ity parameters of conventional statistical tests relating to the model.
Results concerning the asymptot~c properties of the new estimators
are given for bZ
' b3
, b4 and b6.
The three estimators bZ' b3 and b6
are all of the form
i = Z, 3, 6, (4.1)
where bij
and b1j
refer to the j_th coordinate of the vector bi
or b1,
and gi is a random variable bounded by 0 and 1. As a general rule,
these three estimators are found to be preferable by far to either b4
or bS
' Their relative efficiencies are less than 1 for a surprisingly
wide range of Cr, A1, AZ}' The estimator b
6is in many cases an
attractive alternative to b1
, but at other times it has some extremely
unfavorable properties that premonish against its use. The use of
b4
or bS
is not advised under any ~ircumstances.
An estimator that has frequently appeared in the statistical
literature, originally conceived by James and Stein (1961), also
happens to be of the form (4.1). The conclusions which emerge from
the simulation herein lead to some tentative notions about the behavior
of this estimator. The results in this thesis are appraised in the
light of the findings in James and Stein (1961) and recent research
along similar lines by other investigators.
A detailed account of the simulation study with particular
emphasis on its design aspect is presented in the Appendix.
»
43
4.2 Concluaions and Recommendations
One of the vital (though too often ~nderemphasized) properties
of b l is ita robustness to departures from the distributional assump
tions made for E. ~efore certifying any of the new estimators other
than b6 for actual use in prac~ice) a study must be made of their
robustness. Another aspect! of the (bi
} that needs to be examined
is their sensitivity to minor ~hanges in X. When multicollinearity
is present to a serious extent) b1 ia overly responsive to such
changes. It seems unlikely) however) that the proposed estimatora
will be much less sensitiv~ than bl
because of their heavy functional
dependence on it.
Among the many virtues of bl) is the ease in obtaining a best
quadratic unbiased estimate of its variance. No means of obtaining
"good" estimates of measures of reliI;loility of the new estimators has
been presented herein. In vi~w of the f~ct that the exact small-sample
moments of b2
) b3
) b4J I;lnd 05
ar~ unknown) the constrl,lction of such
estimates is likely to be a difficu~t analytical problem.
Since the results of the simulation are almost exclus~vely
limited to the p ~ 2 cas~) ~t ia necessary to consider the probable
effects of the relaxation of this assumption. My guess is that so
long as p does not get "too close" to n) the relative efficiencies EZ
and E3
will behave similarly to what has been discovered when there
are only two regressor~ in the model.
The absence of any knQwle4ge of the values of the (~.} has beenJ
2an explicit assumption throughout this thesis because.~ and rr are
unknown in (1.1). (If there exists such prior knowledge and it is
44
formally considered to be an inherent part of the model, bl may lose
many of its optimum properties.) Hence it is impossible to recommend
any single b. over all others because none of them is uniformly best~
over all O.. }.J
More often, however, the investigator has some idea of the range
of the {A.}, though it may b~ difficult to incorporate such vagueJ
information into the estimation procedure. Depending upon his
willingness to risk using an inefficient estimator in order to have
the opportunity to use a possibly efficient one, he may wish to
consider b2 or b6 as an al. te-rna tive to bl · b2j might be considered
if A. is known to be rat1;le~ small, and b6j merits attention if he
J
is sure that A. is not very large. With the l.lse of b2, one hasJ
little to gain but little to lose, while t1;le employment of b6
can lead
to appreci~ble gain or extreme regret. It is difficult to conceive of
circumstances where one wpuld wish to use b3
, b4
or bS
'
A new line of research that uhis brings to mind is a two-stage
estimation procedure wherein we first estimate the {A.}, (say withJ
~
{A.}), and based on these estimates choose some estimator (possiblyJ
bl ) that is relatively efficient for
"of the {A..}. A convenient estimatorJ
the {A.} in some neighborhoodJ
" 2 "2 j jis A.. ::: bl . / 0- s , which is
J J
distributed as a noncentral ~ variate with 1 and n - p degrees of
freedom and noncentrality pa~ameter A... J
Consider, for instance, the following outline o~ a two_stage
procedure utilizing b6
, which supposes that our objective is to use
b6j rather than blj subjec~ to the guarantee that
45
for some preassigned a s (O) 1). First choose ~jO such that
"Pr~ [A. < A. ]
f\j J - JO> 11
1 - a . (4.2)
Then choose q. just large enough so that E6
.(q.) = 1 when A. = A. ;J J J J JO
viz. : q. = (A. - 1) I (~. + 1).J JO JO
The prior literature dealing with the subject of this thesis is
hardly more encouraging than the results reported here. Of the
references surveyed in ~hapter 2J only two give one much over which
to be optimistic concerning the likelihood of significant future
progress in the study of biased estimation of regression coefficients.
I am impressed with the finding by James and Stein (1961) that (with
their loss function) b l is an inadmi~sible estimator for p > 2. But
admissibility is not often a crucial property of estimators for the
applied statistician because it is so rare that he cannot (with the
knowledge of theoretical cans!derations underlying the model) place
some sort of bounds on the likely ~anges of parameters to be estimated.
tConversely} inadmissible esti~ators are not to be hastily abandoned,
James and Stein have made no mentio~ of the probable quality of their
estimator. As indicated earlier} our results concerning the quality
~. < ~.J - JO
See
test of H :o
distribution.
llThere exists a uniformly most powerful
vs H A. > A. based on the nonaentral betaa J JO .Toro-Vizcarrondo and Wallace {l968} p. ~~4) and Toro-Vizcarrondo(1968) for a full discussion. It follows from Lehmann (l966) pp. 68}80) that there is a uniformly most accurate confidence bound for ~.
of the form indicatecl in (4.2). J
46
of b2 give rise to an educated conjecture that the improvement of (2.1)
over bl
will often be insignificant. Moreover) for the reasons given
in Section 1.4) the applicability of a weighted sum of mean square
errors loss function is often highly doubtful. It is hoped that future
work along these lines by mathematical statisticians will be somewhat
more considerate of the needs of experimental researchers) not the
least of which is a loss structure of form (1.8) rather than (1.6).
While employing the loss structure (1. 7)) Hoerl and Kennard (1970 a)
b) have taken a fresh) novel approach ~o the whole problem) which for
several examples they present has been an unqualified success. The
question of the stability of b l in the face of small changes in the
data is in this context equivalent to the problem of large variances.
It remains to be seen how much more ~tability can be achieved without
adding large biases to the individual estimators.
The prospects for future major improvements upon Gauss-Markov
estimation are not particularly promising. Aside from the ridge
regression procedure) the few successes to date are of limited appli-
cability because they either presuppose much prior knowledge about
the CA,} or are improvements to only a negligible extent. I thinkJ
there is some chance that two-stage estimation procedures of the
sort discussed above may yield sl~ghtly better estimators than those
examined herein) but it should be recognized that ease of computation
is one of the virtues of bl
and as we proceed to explore increasingly
complex estimators) we must begin tq consider whether the extra
computational effort is justified by the prospective gain in efficiency.
47
The Gauss-Markov and Rao~~lackwell Theorems are results of
remarkable conceptual simplicity. If Qne must rule out the possibility
of bringing additiona~ infprmation to bear) I intuitively feel that the
absence of similar~y appealing ~heo~ems for estimation with a mean
square error criterion of goodness sign~fies that a truly satisfying
solution to the problem (a~ ca~t in this thesis) will never be
attained.
48
5. LIST PF ~EfERENCES
Bancroft) T. A. 1944. On biases in est~mation du~ to the use ofpreliminary tests of sig~i£icance. Annals of MathematicalStatistics 15:190-204.
Baranchik, A. J. 1964. MU1Fiple regression and estimation of themean of a multivariate normal distribution. Technical ReportNo. 51, Department of Statistics, Stanford University, Stanford,California.
Baranchik, A. J. 1970. A family of minimax estimators of the meanof a multivariate normal distribution. Annals of MathematicalStatistics 41:642~645.
Bhattacharya) P. K. 1966. Estimating the mean of a multivariatenormal population wi~h gen~ra1 quadratic loss function. Annalsof Mathematical Statistics 37;1819-1824.
Bodewig, E. 1956. Matri~ Calcp1u~; North Holland Publishing Co.,Amsterdam.
Cram{r, H. 1963. Mathematical Methods of Statistics. PrincetonUniversity Press, P~inc~ton} New Jersey.
Farrar, D. E., and Glauber, R. R. 1967. Multicollinearity inregression analysis: the problem revisited. Review of Economicsand Statistics 49:96-10&.
Fraser, D. A. S. 1966. NQqparametric Methods in Statistics. JohnWiley and Sons, In~., New York, New York.
Graybill) F. A. 1961. An Introduction to ~inear Statistical Models)Vol. 1. McGraw-Hill Book Co., Inc.} New York) New York.
Hoerl, A. E.) and Kennard, R. W. 1970~. Ridge regression. Biasedestimation for nonorthogonal problems. Technometrics lZ:55-68.
Hoerl) A. E., and Kennard, R. W. 1970b. Ridge regression. Applica_tions to nonorthogonal problems. 7echnometrics 12:69-82.
James) W.) and Stein} C. 1961. Estimation with quadratic loss.Proceedings of the Fourth Berkeley Symposium o~ MathematicalStatistics and Probability 1:361-379. University of CaliforniaPress, Berkeley and Los Angel~s.
Kendall) M. G.) and Stuart} A. 1967. The Advanced Theory ofStatistics, Vol. II. Hafner P~blishing Co., New York, New York.
49
Lehmann, E. L. 1966. Testing Statistical Hypotheses. John Wiley andSons, Inc., New York) New York.
Lo~ve, M. 1963. Probability Theory. D. Van Nostrand Co., Inc.,Princeton) New Jersey.
Malinvaud, E. 1966. Statistical Methods of Econometrics. RandMcNally and Co.) Inc.) Chicago) Illinois.
Rao, C. R. 1965. Linear Statistical Inference and Its Applications.John Wiley and Sons, Inc.) New York) New York.
Sclove, S. L. 1966. Improved estimation of regression parameters.Technical Report No. 125) Department of Statistics) StanfordUniversity) Stanford) California.
Sclove, S. L. 1968. Improved estimation for coefficients in linearregression. Journal of the American Statistical Association 63:596-606.
Stein) C. 1956. Inadmissibility of the usual estimator for themean of a multivariate normal distribution. Proceedings of theThird Berkeley Symposium on Mathematical Statistics andProbability 1:197-206. University of California Press,Berkeley and Los Angeles.
Toro-Vizcarrondo) C. 1968. Multicollinearity and the mean squareerror criterion in multiple regression: a test and somesequential estimator comparisons. Unpublished ph.D. thesis)Department of Experimental Statistics) North Carolina StateUniversity at Raleigh. University Microfilms) Ann Arbor)Michigan.
Toro-Vizcarrondo, C.) and Wallace, T. D. 1968. A test of themean square criterion for restrictions in linear regression.Journal of the American Statistical Association 63:558-572.
so
6. APPENDIX: THE SIMULATION DESIGN AND PROGRAM
A simulation experiment was used to compute the estimated relative
efficiencies appearing in Table 3.1.12
As explained in Section 3.3,
the computations in the table are based on the assumptions that the
random errors are normally distributed and n = 2S.
The input for the simulation consists of the full rank matrix
2X (n x p), the parameter vector ~, and ~. The program generates the
2n random N1(0, ~) disturbances which comprise e, and computes the
vector Y = X~ + e. Then pretending that we do not know ~ and ~2, it
calculates from X and Y the values of the estimators bl , b2
, b3
, b4
and
bS
(with the exception that the calculation of bS is omitted if p > 2).
This operation is repeated with a new random e for a total of N itera-
tions,' and estimates
m,s.e.(b .. ; ~J')~J
ave (b .. _ ~.)2~J J
(6.1)
are computed for i = 1, 2, 3, 4, Sand j = 1, 2, ... , p. In (6.1),
"ave" refers to the average value over iterations. Finally, the
relative efficiencies are estimated according to
i
(6.2)
2 ..While m. s. e, (b
lj) is known to be equal to (J sJJ, the estimate rather
121 am grateful to Mr, James Goodnight, Department of ExperimentalStatistics, North Carolina State University at Raleigh, for writingthe computer program used in this study.
•
51
than the population value was used in the denominator of (6.2) to check
the effect of any systematicality that might have been present in the
nN generated errors.
Clearly) the input quantities were not chosen haphazardly. The
major task in the design of the simulation was to answer the question)
I!In what respects can X) 13) and cr2 be selected wi thout loss of
generality?1! It was determined that) for p = 2) they can be taken
arbitrarily subject to their leading to the desired values of the
variables r) AI) and A2 defined in Section 3.3. To describe the
behavior of an E.) we estimate it for a number of configurations of~
the quantities upon which the estimate depends.
Equations (3.24) were arrived at through what was essentially aA
trial and error procedure. For example) E3
was unaffected by
99 2 ". ~~ h'l E
A
t b t d bl' h fAd JJ~n s w ~ e 2 was no) u ou ~ng eac 0 ~j) cr ) an sA
r constant) left E2
invariant.
a change
(keeping
Given a finite amount of available computer time) it was neces-
sary to choose a rather limited number of the r) AI) and A2
, Four rls
were chosen: .3),7) .98) and -.7. These are) roughly speaking) a low)
average) high and average negative correlation) respectively. The 7
or 8 values of the I!more important AI! were deemed sufficient to give a
good indication of the functional relationship under consideration.
2In practice) cr was set equal to 1. Next) X was conveniently
chosen subject to its yielding the desired r f fl. The choice of X
fixed the sjj, Then the 13. were selected so as to give the desiredJ
2 2"A. = 13. / cr sJJ) j = 1) 2,J J
52
Another major problem that had to be tackled was the method of
choice of the number of iterations) N. A large number of iterations
was needed to stabilize the sample estimates (6.2») but the computer
time involved was roughly in proportion to N. To check on their
stabilization) the cumulative efficiencies E. were printed out at~
intervals of 100 iterations. It was found that by taking N = 500)
Table 3.1 could be constructed to the degree of accuracy indicated in
Section 3.3. This was thought to be adequate in view of the goal of
the simulation) which was merely to make a comparison between esti-A
mators and not the formal tabulation of moments. The E. in the table~
were informally (but not casually) obtained from a careful examination
of the results at the end of 300) 400) and 500 iterations. As an
illustration of the procedure employed) we consider two examples.
For computing E3
with Aj
= 1 and r = .7) the estimates after 300) 400
and 500 iterations were .786) .768) and .773 respectively. Thus. 77
was employed in Table 3.1. In Section 3.3) an accuracy of ±.02 was
claimed for the CEil; ~'!') that E3 lies between .75 and .79. This
assumption seems fairly safe in view of the stepwise estimates
A
obtained above. Next consider the computation of E2 with Aj = At = 10A
and r = .98. Here the values of E2 after 300) 400 and 500 iterations
were .9979) .9983) and .9982 respectively. Thus the value .998 was
used for Table 3.1. It is even clearer in this case that we have ±.02
accuracy for our estimate; it is not unlikely that the true accuracy
is as fine as ±.OOI.
It would have been preferable to choose N according to some
stopping procedure built into the program; this would have assured that
•
53A
the Eo are measured with approximately equal precision. But it was~
felt that such an inordinate complication of the program would not
significantly enhance the quality of the study.
The same nN = 12,500 random Nl(O, 1) numbers were used (in the
same order) for each entry in Table 3.1 in order to insure the
ceteris paribus nature of the measurement of the effect of a change
in estimator, r, or AIS,
For each entry in the table, the corresponding estimates were made
of "the proportion of m.s.e. attributable to squared bias" by computing
the ratios bi:s2
(b. 0) /m,;.e. (b .• j (30)' i = I, 2, 3, 4, 5, where~J ~JJ
bias (bij ) = ave (bij ) - I3 j '
The construction of Table 3.1 utilized approximately 30 minutes of
time on an l.B.M. 360-75 Computer. This excludes all time consumed in
the design of the simulation and ancillary experiments.
Generalization of these results to p > 2 is likely to present the
investigator with formidable problems of simulation design, for it is
conceivable that some Ei
might depend upon as many as (~) r's and
2P - 2 A's.
..
630.
631.
632.
633.
634.
635.
636.
637.
638.
639.
640.
641.
642.
643.
644.
645.
646.
647.
648•..649.
"650.
651.
•
NORTH CAROLINA STATE UNIVERSITY
INSTITUTE OF STATISTICS(Mimeo Series available for distribution at cost)
Sen, P. K. and M. L. Puri. On some selection procedures in two-way layouts.
Loynes, Robert M. An invariance principle for reversed martingales.
Simons, Gordon. A martingale decomposition theorem.
Sen, P. K. On fixed size confidence bands for the bundle strength of filaments.
Ghosh, Malay. Asymptotic optimal non-parametric tests for miscellaneous problems of linear regression.
Kelly, Douglas G. Concavity of magnetization as a function of external field strength for ising ferromagnets.
Sproule, Raymond Nelson. A sequential fixed-width confidence interval for the mean of aU-statistics.
Loynes, R. M. Stopping time on Brownian motion: Some properties of Roots' construction.
Wegman, Edward J. Non-parametric probability density estimation.
Michaels, Scott Edward. Optimization of testing and estimation procedures for a quadratic regression model.
Cole, J. W. L. Multivariate analysis of variance using patterned covariance matrices.
Leadbetter, M. R. On certain results for stationary point processes and their application.
Fretwell, S. D. On territorial behavior and other factors influencing habitat distribution in birds.
Loynes, R. M. Theorems of ergodic type for stationary sequences with missing observations.
Johnson, Mark Allyn. On the Kiefer-Wolfowitz process and some of its modifications. Ph.D. 'Thesis.
Sen, P. K. and S. K. Chatterjee. On the Kolmogorov-Smimov-type test of synunetry.
Helms, Ronald William. A procedure for the selection of terms and estimation of coefficients in a response surface modelwith integration-orthogonal terms. Ph.D. 'Thesis.
Wegman, Edward. Maximum likelihood estimation of a unimodel density, II.
Sen, P. K. and Malay Ghosh. On bounded length sequential confidence intervals based on one-sample as in rank-orderstatistics.
Seheult, Allan Henry. On unbiased estimation of density functions. Ph.D. 'Thesis.
Williamson, Norman. Some topics in system theory. Ph.D. Thesis.
Weber, Donald Chester. A stochastic model for automobile accident experience. Ph.D. 'Thesis.