james stein estimator

James Stein Estimator

⇤

Tony Ke

January 29, 2012

1 James-Stein Estimator

To get to learn James-Stein estimator – “one of the most important statisticalideas of the decade (of the sixties)” [8], we need to first sharpen our memoryon some basic but most important statistical concepts.

1.1 Estimator

Sometimes quoted as a decision rule, an estimator is a rule for calculating anestimate of a unknown quantity based on observed data. We use ✓ 2 ⇥ ✓ Rn

to denote the unknown parameter vector, X 2 X ✓ Rn to denote the randomvector, and correspondingly x as the realized data point.1 Then an estimatoris a mapping ✓̂ from data set X to parameter space ⇥.

Example 1.1. (Ordinary Estimator)For X ⇠ N (✓, �2

I

n

),✓̂(x) = x. (1.1)

This is a very straightforward estimator: we perform one measurementfor each unknown quantity, and use the measurement result as the estimatorof the unknown quantity. However, out of one’s surprise, we will show thisordinary estimator is not the “best” estimator in some sense.

⇤Some definitions and examples in this note are excerpted and modified from

wikipedia.org, though not explicitly cited in the context.

1For sake of conceptual simplicity, we often talk about the data set X which consists

of only one data point in the following discussion. The generation is straightforward.

1

Tony Ke

Prepared by Tony Ke for UC Berkeley MFE 230K class. All rights reserved.

Tony Ke

1.2 Risk Function

How to measure the goodness of an estimator? Intuitively an estimator isgood, if it is close to the unknown parameter of interest, or equivalentlyspeaking the estimation error is small. We use loss function L(✓, ✓̂) to char-acterize the estimation error.

Example 1.2. (Quadratic Loss)

L(✓, ✓̂(x)) = |✓̂(x)� ✓|2, (1.2)

where | · | is the Euclidean norm.

One should notice, the loss function is data-specific. In most cases, weinstead, want to get an overall judgement of the estimation goodness of anestimator. Then it comes to the risk function R(✓, ✓̂), which is defined asthe expected value of a loss function

R(✓, ✓̂) = E✓L(✓, ✓̂(X)), (1.3)

where E✓ is the expectation over all population values of X.

Example 1.3. (Mean Squared Error)Mean squared error risk corresponds to a quadratic loss function,

R(✓, ✓̂) = E✓|✓̂(X)� ✓|2. (1.4)

For an ordinary estimator ✓̂(X) = X under X ⇠ N (✓, �2I

n

), R(✓, ✓̂) =E✓|X� ✓|2 = n�

2. (Verify it by yourself !)

1.3 Admissible Decision Rule

After introducing the risk function to compare the goodness of any two es-timators, one natural question is know how to define the “best” estimator,which turns out to be a non-trivial question.

An admissible decision rule is defined as a rule for making a decisionsuch that there isn’t any other rule that is always “better” than it. Whywe need to specify “always”? Because the parameter is unknown, a decisionrule can perform well for some underlying parameter, while poorly for others.Mathematically speaking, we say ✓̂

⇤is an admissible decision rule if

@✓̂ s.t. R(✓, ✓̂) � R(✓, ✓̂⇤) for all ✓ 2 ⇥ , and R(✓, ✓̂) > R(✓, ✓̂

⇤) for some ✓ 2 ⇥.

2

1.4 Stein’s Paradox

Now we are ready to present Stein’s amazing discovery:The ordinary estimator for the mean of a multivariate Gaussian distri-

bution is inadmissible under mean squared error risk in dimension at leastthree.

Example 1.4. (Stein’s Example)For X ⇠ N (✓, I

n

), let’s consider the following estimator

✓̂S

(x) = x� (n� 2)�2

|x|2 x (1.5)

which will be shown as a better estimator than the ordinary estimator ✓̂(x) =x in example 1.1.

R(✓, ✓̂S

) = E✓

"��✓ �X+(n� 2)�2

|X|2 X

��2#

= E✓

|✓ �X|2 + 2(✓ �X)T

(n� 2)�2

|X|2 X+(n� 2)2�4

|X|4 |X|2�

= E✓

⇥|✓ �X|2

⇤+ 2(n� 2)�2E✓

(✓ �X)TX

|X|2

�+ (n� 2)2�4E✓

1

|X|2

�

= n�

2 � (n� 2)2�4E✓

1

|X|2

�< n�

2. (1.6)

The last equation comes from integration by parts. It’s not hard to show that

E✓ [(✓i �X

i

)h(X)] = �E✓

h@h

@xi(X)

i, for any “well-behaved” function h(·).

Stein’s example shows that for estimation of mean of multi-variate Gaus-sian distribution, Stein’s estimator (1.5) is better than the ordinary estimatorin that its risk function is smaller. As a quirky example, we measure the speedof light (✓1), tea consumption in Taiwan (✓2), and hog weight in Montana(✓3), and observe data point x = (x1, x2, x3). Estimates ✓̂

i

(xi

) = x

i

based onindividual measurement is worse than the one based on measurements on allthree quantities together ✓̂

i

(x) =⇣1� 1

|x|2

⌘x

i

, though the three quantities

have nothing to do with each other.The Stein estimator always improves upon the total mean squared error

risk, i.e., the sum of the expected errors of each component. Therefore, the

3

total mean squared error in measuring light speed, tea consumption, and hogweight would improve by using the Stein estimator. However, any particularcomponent (such as the speed of light) would improve for some parametervalues, and deteriorate for others. Thus, although the Stein estimator dom-inates the ordinary estimator when three or more parameters are estimated,any single component does not dominate the respective component of theordinary estimator.

It’s also worthwhile to point out that validity of Stein’s paradox doesn’tdepend on the quadratic form of the loss function. Though quadratic lossfunction can approximate any well-behaved general functions by Tyler ex-pansion. Brown has extended the conclusion of inadmissibility to fairly weakloss function conditions [1, 2, 3].

1.5 Stein’s Original Argument

It is not intuitively clear why the Stein estimator dominates the ordinary one.Stein’s original argument [10] is based on a comparison of ✓T✓ to xTx whenn is large and proceeds as follows. (This note is based on [9]). Intuitively,a good estimate ✓̂ should satisfy ✓̂

i

⇡ ✓

i

for i = 1, 2, · · · , n, which implies✓̂

2i

⇡ ✓

2i

for i = 1, 2, · · · , n, and thus

✓̂T

✓̂ ⇡ ✓T✓. (1.7)

We would hope that a chosen estimator satisfies this condition. For x ⇠N (✓, �2

I

n

), the ordinary estimator ✓̂(x) = x. Let y = 1n

xTx = 1n

Pn

i=1 x2i

sothat E[y] = �

2 + 1n

✓T✓. Law of large number implies2

1

n

xTx ! �

2 +1

n

✓T✓, as n ! 1. (1.8)

In other words, for large n, it is very likely that xTx exceeds ✓T✓. Thissuggests that to form a good estimator of ✓, we would need to shrink theordinary estimator toward 0, which is exactly what the Stein’s estimatordoes. That’s also the reason why Stein’s estimator is also called shrinkageestimator.

2Notice that xi for i = 1, 2, · · · , n are not identically distributed, but still the law of

large number holds, which can be proved by Chevyshev’s inequality.

4

1.6 Empirical Bayes Perspective

Another argument for Stein’s estimator is based on the Bayesian formulation[4, 9]. Suppose the prior distribution on ✓ is

✓ ⇠ N(0, ⌧ 2In

). (1.9)

Then for x ⇠ N (✓, �2I

n

), the posterior distribution on ✓ would be

p(✓|x) ⇠ p(x|✓) · p(✓)

⇠ exp

✓� 1

2�2(x� ✓)T (x� ✓)

◆· exp

✓� 1

2⌧ 2✓T✓

◆

⇠ exp

(� 1

2�

�

2⌧

2

�

2+⌧

2

��✓ �

✓1� �

2

�

2 + ⌧

2

◆x

��2)

(1.10)

So ✓|x ⇠ N

⇣⇣1� �

2

�

2+⌧

2

⌘x,

⇣�

2⌧

2

�

2+⌧

2

⌘⌘. The Bayes least-squares estimate of

✓ is

✓̂B

=

✓1� �

2

�

2 + ⌧

2

◆x. (1.11)

✓̂B

achieves the smallest risk for any ✓, under quadratic loss function. Oneshould notice that the definition of risk function from Bayesian perspectivedi↵ers from that from a frequentist’s perspective. In Bayesian framework, wetake expectation of the loss function over the posterior distribution of ✓; whilein frequentists’ framework, we take expectation over the population space ofx. Instead of an admissible rule, we usually call the risk-function-minimizingrule a Bayes rule in the Bayesian framework.

The expression of ✓̂B

in equation (1.11) is intended to evoke the formof Stein’s estimator. In fact, instead of determining ⌧

2 from outside, wecan estimate the prior density from the data, and then apply the Bayesianframework. This approach is known as empirical Bayesian estimation. Itcan be shown that �

2(n�2)x

Tx

is an unbiased estimator of �

2

�

2+⌧

2 . By substitutingthis estimator back into (1.11), Stein’s estimator (1.5) is obtained.

1.7 James-Stein Estimator

Stein’s idea has been completed and improved later, notably by James [6],Efron and Morris [5]. Let’s consider Stein’s idea in a more general case.

5

For xt

⇠ N(✓,⌃) (t = 1, 2, · · · , T ), similarly we can show that the ordinarymaximum-likelihood estimator ✓̂

ML

= x̄ ⌘ 1T

PT

t=1 xt

is inadmissible. Thefollowing so called James-Stein estimator is a better estimator for n � 3

✓̂JS

= (1� k)x̄+ kx01, (1.12)

where x0 is an arbitrary number and k is defined as

k =(n� 2)/T

(x̄� x01)T ⌃�1 (x̄� x01)

. (1.13)

We find the James-Stein estimator shrinks not only toward 0 but also indirection of x01. For ⌃ = �

2I

n

, T = 1 and x0 = 0, (1.12) goes back to Stein’sestimator (1.5).

2 Jorion 1986 Paper

James-Stein method improves accuracy when estimating more than two quan-tities together. This makes it a natural fit in estimation of multiple assetreturns in a portfolio.

2.1 Framework

Jorion (1986) considered the parameter uncertainties in portfolio optimiza-tion problem [7]. He cares more on the uncertainty of return mean, ratherthan the uncertainty of return variance-covariance matrix. As pointed out byProf. Leland in the class, there are two rationales behand: (1) the optimalportfolio allocation is very sensitive to the change of mean; (2) the estimationaccuracy on variance-covariance matrix will get refined if one gets data infiner time scale, such as high-frequency data.

An empirical Bayes method is applied, with the prior on the mean ofasset returns as

p(µ|V,�, Y0) ⇠ exp

�1

2(µ� Y01)

T

�V

�1(µ� Y01)

�. (2.1)

By repeating a similar procedure in section 1.6, we obtain the optimal esti-mator as the James-Stein estimator

µ̂B

(R) = (1� k)µ̂ML

(R) + kµ

min

(R)1. (2.2)

6

where R represents all the asset return data, µ̂ML

(R) = x̄ is the ordinarymaximum-likelihood estimate of the mean, and

k =�

�+ T

(2.3)

Y0 =1T

V

�1

1T

V

�11µ̂

ML

(R). (2.4)

Y0 happens to be the average return for the minimum variance portfolio.One can verify the allocation weights 1

TV

�1

1

TV

�11

minimize the variance of theportfolio subject to the condition that they sum to one.

We can further estimate the shrinkage coe�cient from data

k̂ =n+ 2

(n+ 2) + (µ̂ML

� µ

min

(R)1)T TV

�1 (µ̂ML

� µ

min

(R)1), (2.5)

which as one might have recognized, is exactly a James-Stein estimator3. Inpractice, V is unknown, and could be replaced by

V̂ =T � 1

T � n� 2S, (2.6)

where S is the usual unbiased sample covariate matrix.

2.2 Example

Figure 1 illustrates a sample estimates from stock market returns for sevenmajor countries, calculated over a 60-month period. The underlying param-eters µ and ⌃ were chosen equal to the estimates reported in Figure 1. ThenT independent vectors of returns were generated from this distribution, andthe following estimators are computed:

1. Certainty Equivalence: classical mean-variance optimization

2. Bayes Di↵use Prior: Klein and Bawa (1976) uninformative prior

3. Minimum Variance: � ! 1 and k = 1

4. Bayes-Stein estimator

The results are shown in Figure 2. We can see Bayes-Stein estimator beatsall others in estimation accuracy for relatively large sample size T � 50.

3n is the number of assets in the portfolio, as represented by J in the lecture note.

7

Figure 1: Excerpted from Jorion (1986) paper. Dollar returns in percent permonth. Sample period is Jan 1977 to Dec 1981.

Figure 2: Excerpted from Jorion (1986) paper. F

max

is the investor’s util-ity function calculated from the true underlying parameters, which is thetheoretically maximum utility that can be achieved. F̄

i

is investor’s utilityfunction, when she/he adopts the corresponding estimator calculated fromsimulation samples. y-axis on the left shows the relative di↵erence of theutility functions, which directly characterizes the goodness of estimation.

8

References

[1] L. Brown. On the admissibility of invariant estimators of one or morelocation parameters. The Annals of Mathematical Statistics, 37(5):1087–1136, 1966.

[2] L. Brown. Estimation with incompletely specified loss functions (thecase of several location parameters). Journal of the American StatisticalAssociation, 70(350):417–427, 1975.

[3] L. Brown. A heuristic method for determining admissibility ofestimators–with applications. The Annals of Statistics, pages 960–994,1979.

[4] B. Efron and C. Morris. Limiting the risk of bayes and empirical bayesestimatorspart ii: The empirical bayes case. Journal of the AmericanStatistical Association, 67(337):130–139, 1972.

[5] B. Efron and C. Morris. Stein’s estimation rule and its competitorsanempirical bayes approach. Journal of the American Statistical Associa-tion, 68(341):117–130, 1973.

[6] W. James and C. Stein. Estimation with quadratic loss. In Proceed-ings of the fourth Berkeley symposium on mathematical statistics andprobability, volume 1, pages 1–379, 1961.

[7] P. Jorion. Bayes-stein estimation for portfolio analysis. Journal of Fi-nancial and Quantitative Analysis, 21(3):279–292, 1986.

[8] D. Lindley. Discussion on professor stein’s paper. Journal of the RoyalStatistical Society, 24:285–287, 1962.

[9] J. Richards. An introduction to james-stein estimation, 1999.

[10] C. Stein. Inadmissibility of the usual estimator for the mean of a multi-variate normal distribution. In Proceedings of the Third Berkeley sym-posium on mathematical statistics and probability, volume 1, pages 197–206, 1956.

9

james stein estimator

Documents