chapter 7 estimation - bauer college of business · 2014-08-22 · 2 . criteria for estimators •...

Chapter 7 Estimation

Criteria for Estimators

• Main problem in stats: Estimation of population parameters, say θ. • Recall that an estimator is a statistic. That is, any function of the observations {xi} with values in the parameter space.

• There are many estimators of θ. Question: Which is better? • Criteria for Estimators (1) Unbiasedness (2) Efficiency (3) Sufficiency (4) Consistency

Unbiasedness

Definition: Unbiasedness An unbiased estimator, say , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, E[ ] = θ Example: E[ ] = µ E[s2] = σ2

Definition: Efficiency / Mean squared error An estimator is efficient if it estimates the parameter of interest in

some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality:

b(θ): E[( - θ) bias in . The MSE is the sum of the variance and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an

unbiased estimator. Note: The most efficient estimator among a group of unbiased

estimators is the one with the smallest variance => BUE.

Efficiency

( ) ( ) 222)]([)ˆ()ˆ()ˆ(ˆˆMSE θθθθθθθθ bVarEEEE +=

−+−=

Now we can compare estimators and select the “best” one. Example: Three different estimators’ distributions

– 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST. 2 is more efficient than 1.

Efficiency

Value of Estimator

1, 2, 3 based on samples of the same size

Relative Efficiency It is difficult to prove that an estimator is the best among all estimators, a relative concept is usually used. Definition: Relative efficiency

Example: Sample mean vs. sample median Variance of sample mean = σ2/n Variance of sample median = πσ2/2n Var[median]/Var[mean] = (πσ2/2n) / (σ2/n) = π/2 = 1.57 The sample median is 1.57 times less efficient than the sample mean.

estimator second of Varianceestimatorfirst of Variance Efficiency Relative =

Asymptotic Efficiency

• We compare two sample statistics in terms of their variances. The statistic with the smallest variance is called efficient. • When we look at asymptotic efficiency, we look at the asymptotic variance of two statistics as n grows. Note that if we compare two consistent estimators, both variances eventually go to zero. Example: Random sampling from the normal distribution

• Sample mean is asymptotically normal[μ,σ2/n] • Median is asymptotically normal [μ,(π/2)σ2/n] • Mean is asymptotically more efficient

• Definition: Sufficiency A statistic is sufficient when no other statistic, which can be calculated from the same sample, provides any additional information as to the value of the parameter of interest. Equivalently, we say that conditional on the value of a sufficient statistic for a parameter, the joint probability distribution of the data does not depend on that parameter. That is, if P(X=x|T(X)=t, θ) = P(X=x|T(X)=t) we say that T is a sufficient statistic. • The sufficient statistic contains all the information needed to estimate the population parameter. It is OK to ‘get rid’ of the original data, while keeping only the value of the sufficient statistic.

Sufficiency

• Visualize sufficiency: Consider a Markov chain θ → T(X1, . . . ,Xn) → {X1, . . . ,Xn} (although in classical statistics θ is not a RV). Conditioned on the middle part of the chain, the front and back are independent. Theorem Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is a sufficient statistic for θ if, for every x in the sample space, the ratio of

is a constant as a function of θ. Example: Normal sufficient statistic: Let X1, X2, … Xn be iid N(μ,σ2) where the variance is known. The sample mean, , is the sufficient statistic for μ.

( )( )

p xq t

Sufficiency

Proof: Let’s starting with the joint distribution function

( ) ( )

( )( )

22 2 1

1 exp22

σπσ

µσπσ

−= −

• Next, add and subtract the sample mean:

( )( )

( ) ( )

22 2 1

1 exp22

x x xf x

x x n x

σπσ

− + −= −

− + − = −

• Recall that the distribution of the sample mean is

( )( )( )

1 22 2

1 exp22

n xq T X

σσπ

−= −

• The ratio of the information in the sample to the information in the statistic becomes independent of μ

( )( )( )

( ) ( )

( )( )

1 22 2

1 exp22

x x n x

q T x n x

σπσθ

θ µσσπ

− + − − =

−−

( )( )( ) ( )

11 21 2 22

1 exp22

x xf x

q T x n

θσθ πσ

− = −

Theorem: Factorization Theorem Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ • Sufficient statistics are not unique. From the factorization theorem it is easy to see that (i) the identity function T(X) = X is a sufficient statistic vector and (ii) if T is a sufficient statistic for θ then so is any 1-1 function of T. Then, we have minimal sufficient statistics. Definition: Minimal sufficiency A sufficient statistic T(X) is called a minimal sufficient statistic if, for any other sufficient statistic T ’(X), T'(X) is a function of T (X).

( ) ( )( ) ( )f x g T x h xθ θ=

Sufficiency

Definition: Consistency The estimator converges in probability to the population parameter being estimated when n (sample size) becomes larger That is, θ. We say that is a consistent estimator of θ. Example: is a consistent estimator of μ (the population mean). • Q: Does unbiasedness imply consistency? No. The first observation of {xn}, x1, is an unbiased estimator of μ. That is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in probability to μ.

Consistency

n^θ →p

Definition: Squared Error Consistency The sequence { } is a squared-error consistent estimator of θ, if limn→∞ E[( - θ)2] = 0 That is, θ. • Squared-error consistency implies that both the bias and the variance of an estimator approach zero. Thus, squared-error consistency implies consistency.

Squared-Error Consistency

n^θ → ..sm

Order of a Sequence: Big O and Little o

• “Little o” o(.). A sequence {xn}is o(nδ) (order less than nδ) if |n-δ xn|→ 0, as n → ∞. Example: xn = n3 is o(n4) since |n-4 xn|= 1 /n → 0, as n → ∞. • “Big O” O(.). A sequence {xn} is O(nδ) (at most of order nδ ) if n-δ xn → ψ, as n → ∞ (ψ≠0, constant). Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0. Special case: O(1): constant • Order of a sequence of RV The order of the variance gives the order of the sequence. Example: What is the order of the sequence { }? Var[ ] = σ2/n, which is O(1/n) -or O(n-1).

Root n-Consistency • Q: Let xn be a consistent estimator of θ. But how fast does xn converges to θ ?

The sample mean, , has as its variance σ2/n, which is O(1/n). That is, the convergence is at the rate of n-½. This is called “root n-consistency.” Note: n½ has variance of O(1).

• Definition: nδ convergence? If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ –convergent. Example: Suppose var(xn) is O(1/n2). Then, xn is n–convergent. The usual convergence is root n. If an estimator has a faster (higher degree of) convergence, it’s called super-consistent.

Estimation • Two philosophies regarding models (assumptions) in statistics: (1) Parametric statistics. It assumes data come from a type of probability distribution and makes inferences about the parameters of the distribution. Models are parameterized before collecting the data. Example: Maximum likelihood estimation.

(2) Non-parametric statistics. It assumes no probability distribution –i.e., they are “distribution free.” Models are not imposed a priori, but determined by the data. Examples: histograms, kernel density estimation. • In general, parametric statistics makes more assumptions.

Least Squares Estimation

• Long history: Gauss (1795, 1801) used it in astronomy. • Idea: There is a functional form relating Y and k variables X. This function depends on unknown parameters, θ. The relation between Y and X is not exact. There is an error, ε. We will estimate the parameters θ by minimizing the sum of squared errors. (1) Functional form known

yi = f(xi, θ) + εi (2) Typical Assumptions

- f(x, θ) is correctly specified. For example, f(x, θ) = X β - X are numbers with full rank --or E(ε|X) = 0. That is, (ε ⊥ x) - ε ~ iid D(0, σ2 I)

• Objective function: S(xi, θ) =Σi εi2

• We want to minimize w.r.t to θ. That is, minθ {S(xi, θ) =Σi εi

2 = Σi [yi - f(xi, θ)]2 } => d S(xi, θ)/d θ = - 2 Σi [yi - f(xi, θ)] f ‘(xi, θ) f.o.c. => - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) =0 Note: The f.o.c. deliver the normal equations. The solution to the normal equation, θLS, is the LS estimator. The estimator θLS is a function of the data (yi ,xi).

Suppose we assume a linear functional form. That is, f(x, θ) = Xβ. Using linear algebra, the objective function becomes S(xi, θ) =Σi εi

2 = ε’ε = (y- X β)’ (y- X β) The f.o.c. - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)’ X =0 where b = βOLS. (Ordinary LS. Ordinary=linear) Solving for b => b = (X’ X)-1 X’ y Note: b is a (linear) function of the data (yi ,xi).

Least Squares Estimation The LS estimator of βLS when f(x, θ) = X β is linear is b = (X′X)-1 X′ y Note: b is a (linear) function of the data (yi ,xi). Moreover, b = (X′X)-1 X′ y = (X′X)-1 X′ (Xβ + ε) = β +(X′X)-1 X′ε Under the typical assumptions, we can establish properties for b. 1) E[b|X]= β 2) Var[b|X] = E[(b-β) (b-β)′|X] =(X′X)-1 X’E[ε ε′|X] X(X′X)-1 = σ2 (X′X)-1 Under the typical assumptions, Gauss established that b is BLUE. 3) If ε|X ~ iid N(0, σ2In) => b|X ~iid N(β, σ2 (X’ X)-1) 4) With some additional assumptions, we can use the CLT to get b|X N(β, σ2/n (X’ X/n)-1) →a

Maximum Likelihood Estimation

• Idea: Assume a particular distribution with unknown parameters. Maximum likelihood (ML) estimation chooses the set of parameters that maximize the likelihood of drawing a particular sample. • Consider a sample (X1, ... , Xn) which is drawn from a pdf f(X|θ) where θ are parameters. If the Xi’s are independent with pdf f(Xi|θ) the joint probability of the whole sample is:

)|Xf(=)|X...Xf(XL i

=1in1 θθθ ∏=)|(

The function L(X| θ) --also written as L(X; θ)-- is called the likelihood function. This function can be maximized with respect to θ to produce maximum likelihood estimates ( ). MLE

Maximum Likelihood Estimation

• It is often convenient to work with the Log of the likelihood function. That is, ln L(X|θ) = Σi ln f(Xi| θ).

• The ML estimation approach is very general. Now, if the model is not correctly specified, the estimates are sensitive to the misspecification.

Ronald Fisher (1890 – 1962)

Maximum Likelihood: Example I

Let the sample be X={5, 6, 7, 8, 9, 10}drawn from a Normal(μ,1). The probability of each of these points based on the unknown mean, μ, can be written as:

( ) ( )

−−=

210exp

Assume that the sample is independent.

Then, the joint pdf function can be written as: The value of µ that maximize the likelihood function of the sample can then be defined by It easier, however, to maximize ln L(X|μ). That is,

( )( )

( ) ( ) ( )

−−

−−=

− 210

µµµπ

( )µµ

|max XL

( )( ) ( ) ( ) ( )

( ) ( ) ( )_

61098765ˆ

25|lnmax

MLE =+++++

=µ−++µ−+µ−

µ−−

µ∂∂

⇒µµ

• Let’s generalize this example to an i.i.d. sample X={X1, X2,..., XT}drawn from a Normal(μ,σ2). Then, the joint pdf function is: Then, taking logs, we have: We take first derivatives:

)X)X µ−′µ−σ

−σ−π−=µ−σ

−πσ−= ∑=

12ln2 2

2 TTXTLT

∏∏=

µ−−πσ=

µ−−

i XXL1

exp)2(2

∑∑

µ−σ

−=σ∂

µ−σ

=−µ−σ

−=µ∂

)(1)1()(22

• Then, we have the f.o.c. and jointly solve for the ML estimators: Note: The MLE of μ is the sample mean. Therefore, it is unbiased. Note: The MLE of σ2 is not s2. Therefore, it is biased!

==µ⇒=µ−σ

=µ∂

∂ ∑∑== 11

21ˆ0)ˆ(

ˆ1)1(

∑∑==

−=σ⇒=µ−σ

−=σ∂

iMLEi XX

MLEMLE 1

2422 )(1ˆ0)ˆ(

ˆ2ln)2(

• We will work the previous example with matrix notation. Suppose we assume: where Xi is a 1xk vector of exogenous numbers and β is a kx1 vector of unknown parameters. Then, the joint likelihood function becomes: • Then, taking logs, we have the log likelihood function::

∏∏=

ε−πσ=

2 2exp)2(

),0(~2

IεεXβy

σεε+=

12ln22

2 Xβ(y)Xβy −′−σ

−πσ−=εσ

−πσ−= ∑=

Maximum Likelihood: Example II

• The joint likelihood function becomes: • We take first derivatives of the log likelihood wrt β and σ2: • Using the f.o.c., we jointly estimate β and σ2: :

2422 TTL T

ii −

σσ=ε

σ−−

σ−=

σ∂∂ ∑

εX'x 22'

1/221ln

σ−=σε−=

β∂∂ ∑

−σ−π−=εσ

−πσ−= ∑=

Maximum Likelihood: Example II

−==σ⇒=−

σ∂∂ T

MLEiiMLE

MLEMLE Ty

222)ˆ(ˆ0]

ˆ21(ln βXee'ee'

yX'XXββXyX'εX' 122 )'(ˆ0)ˆ(11ln −=⇒=−

σ−=

β∂∂

MLEMLEL

Definition: Score (or efficient score)

S(X; θ) is called the score of the sample. It is the vector of partial derivatives (the gradient), with respect to the parameter θ. If we have k parameters, the score will have a kx1 dimension.

Definition: Fisher information for a single sample:

I(θ) is sometimes just called information. It measures the shape of the log f(X|θ).

ML: Score and Information Matrix

∑ ===

ii ))(f(x))(L(X)S(X

|log|log;δθ

θδδθ

θδθ

)(|log 2

θ∂θ∂ I))(f(XE

• The concept of information can be generalized for the k-parameter case. In this case:

This is kxk matrix.

If L is twice differentiable with respect to θ, and under certain regularity conditions, then the information may also be written as9

I(θ) is called the information matrix (negative Hessian). It measures the shape of the likelihood function.

)('|logloglog T

θIθθ

θθθ

∂∂

∂∂ ))(L(X-ELLE

)(loglog T

∂∂

∂∂ I

θθLLE

• Properties of S(X; θ):

(1) E[S(X; θ)]=0.

0)];([0);();(log

0);();();(

0);(1);(

=⇒=∂

∂⇒=

∫∫∫

θθθ

xSEdxxfxf

dxxfxfxf

dxxfdxxf

∑ ===

ii ))(f(x))(L(X)S(X

|log|log;δθ

θδδθ

θδθ

(2) Var[S(X; θ)]= n I(θ)

)(]);(log[)];([

);(log);(log

);(log);();(log

);(log);();();(

1);(log

);(log);();(log

:more once integral above theatedifferenti sLet'

0);();(log

θθθ

InxfVarnXSVar

IxfExfE

dxxfxfdxxfxf

dxxfxfdxxfxfxf

dxxfxfdxxfxf

dxxfxf

∂∂∂

∂∂

=∂∂

∂∂

=∂∂

∂∂

=∂∂

∂∂

∫∫

(3) If S(xi; θ) are i.i.d. (with finite first and second moments), then we can apply the CLT to get:

Sn(X; θ) = Σi S(xi; θ) N(0, n I(θ)).

Note: This an important result. It will drive the distribution of MLE estimators.

• Again, we assume: • Taking logs, we have the log likelihood function: • The score function is –first derivatives of log L wrt θ=(β,σ2):

ML: Score and Information Matrix – Example

),0(~2

IεεXβy

σεε+=

2422 TTL T

ii −

σσ=ε

σ−−

σ−=

σ∂∂ ∑

εX'x 22'

1/221ln

σ−=σε−=

β∂∂ ∑

−σ−π−=εσ

−πσ−= ∑=

• Then, we take second derivatives to calculate I(θ): : • Then,

−=σ∂β∂

iii xL

142 '1

iii '1/'

σ=σ−=

β∂β∂∂ ∑

ln24422422 TTL

−σσ

−=σ

+−σσ

−=σ∂σ∂

∂ εε'εε'εε'

σ=θ∂θ∂

∂−=θ

0)'1(]

'ln[)( T

ML: Score and Information Matrix – Example

In deriving properties (1) and (2), we have made some implicit assumptions, which are called regularity conditions:

(i) θ lies in an open interval of the parameter space, Ω.

(ii) The 1st derivative and 2nd derivatives of f(X; θ) w.r.t. θ exist.

(iii) L(X; θ) can be differentiated w.r.t. θ under the integral sign.

(iv) E[S(X; θ) 2]>0, for all θ in Ω.

(v) T(X) L(X; θ) can be differentiated w.r.t. θ under the integral sign.

Recall: If S(X; θ) are i.i.d. and regularity conditions apply, then we can apply the CLT to get:

S(X; θ) N(0, n I(θ))

Theorem: Cramer-Rao inequality Let the random sample (X1, ... , Xn) be drawn from a pdf f(X|θ) and let T=T(X1, ... , Xn) be a statistic such that E[T]=u(θ), differentiable in θ. Let b(θ)= u(θ) - θ, the bias in T. Assume regularity conditions. Then,

Regularity conditions:

(1) θ lies in an open interval Ω of the real line.

(2) For all θ in Ω, δf(X|θ)/δθ is well defined.

(3) ∫L(X|θ)dx can be differentiated wrt. θ under the integral sign

(4) E[S(X;θ)2]>0, for all θ in Ω

(5) ∫T(X) L(X|θ)dx can be differentiated wrt. θ under the integral sign

)()]('1[

)()]('[ 22

nIuVar(T) +

ML: Cramer-Rao inequality

The lower bound for Var(T) is called the Cramer-Rao (CR) lower bound.

Corollary: If T(X) is an unbiased estimator of θ, then

Note: This theorem establishes the superiority of the ML estimate over all others. The CR lower bound is the smallest theoretical variance. It can be shown that ML estimates achieve this bound, therefore, any other estimation technique can at best only equal it.

)()]('1[

)()]('[ 22

nIuVar(T) +

1))(( −≥ θnIVar(T)

Proof: For any T(X) and S(X;θ) we have

[Cov(T,S)]2 ≤ Var(T) Var(S) (Cauchy-Schwarz inequality)

Since E[S]=0, Cov(T,S)=E[TS].

Also, u(θ) = E[T] = ∫ T L(X;θ) dx. Differentiating both sides:

u’(θ) = ∫ T δL(X;θ)/δθ dx = ∫ T [1/L δL(X;θ)/δθ] L dx

= ∫ T S L dx = E[TS] = Cov(TS)

Substituting in the Cauchy-Schwarz inequality:

[u’(θ)]2 ≤ Var(T) n I(θ) => Var(T) ≥[u’(θ)]2/[n I(θ)] ■

Note: For an estimator to achieve the CR lower bound, we need

[Cov(T,S)]2 = Var(T) Var(S).

This is possible if T is a linear function of S. That is,

T(X) = α(θ) S(X;θ) + β(θ)

Since E[T] = α(θ) E[S(X;θ)] + β(θ) = β(θ) . Then,

S(X;θ) = δ log L(X;θ)/δθ =[T(X) - β(θ)]/ α(θ).

Integrating both sides wrt to θ:

log L(X;θ) = U(X) – T(X) A(θ)+ B(θ)

That is, L(X;θ) = exp{ΣiU(Xi) – A(θ) ΣiT(Xi) + n B(θ)}

Or, f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}

f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}

That is, the exponential (Pitman-Koopman-Darmois) family of distributions attain the CR lower bound.

• Most of the distributions we have seen belong to this family: normal, exponential, gamma, chi-square, beta, Weibull (if the shape parameter is known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known parameter r), and geometric.

• Note: The Chapman–Robbins bound is a lower bound on the variance of estimators of θ. It generalizes the Cramér–Rao bound. It is tighter and can be applied to more situations –for example, when I(θ) does not exist. However, it is usually more difficult to compute.

• When we have k parameters, then covariance matrix of the estimator T(X) has a CR lower bound given by: Note: In matrix notation, the inequality A ≥ B means the matrix A-B is positive semidefinite. If T(X) is unbiased, then

T1 )()()())(

θθθI

θθXT

∂∂

≥ − uuCovar(

Cramer-Rao inequality: Multivariate Case

1)())( −≥ θIXTCovar(

C. R. Rao (1920, India) & Harald Cramer (1893-1985, Sweden)

We want to check if the sample mean and s2 for an i.i.d. sample X={X1, X2,..., XT}drawn from N(μ,σ2) achieve the CR lower bound. Recall:

Since the sample mean and s2 are unbiased, the CR lower bound is given by:

We have already derived that Var( ) = σ2/n and Var(s2) = 2 σ4/(n-1). Then, the sample mean achieves its CR bound, but s2 does not..

1)( −θ≥ IT )Covar(

Cramer-Rao inequality: Example

σ=∂∂

∂−=θ

'ln[)( n

nLEθθ

nVar(s

n)XVar(

2 2)& σ≥

• We split the parameter vector θ into two vectors:

),L(=)L( 21 θθθ

Sometimes, we can derive a formulae for the ML estimate of θ2, say:

)g(= 12 θθ

If this is possible, we can write the Likelihood function as

)(L=))g(,L(=),L( 1*

1121 θθθθθ

This is the concentrated likelihood function. • This process is often useful as it reduces the number of parameters needed to be estimated.

Concentrated ML

• The normal log likelihood function can be written as:

( ) ( )∑ =−−−=

i iXnL1

2),(ln µ

σσσµ

Concentrated ML: Example

• This expression can be solved for the optimal choice of σ2 by differentiating with respect to σ2:

( )( )

( )∑∑

−=⇒

=−+−⇒

=−+−=∂

i iMLE

µσσσ

• Substituting this result into the original log likelihood produces:

( )( )

−−=

−−

−−=

∑∑

Concentrated ML: Example

• Intuitively, the ML estimator of µ is the value that minimizes the MSE of the estimator. Thus, the least squares estimate of the mean of a normal distribution is the same as the ML estimator under the assumption that the sample is i.i.d.

Properties of ML Estimators (1) Efficiency. Under general conditions, we have that

The right-hand side is the Cramer-Rao lower bound (CR-LB). If an

estimator can achieve this bound, ML will produce it. (2) Consistency. We know that E[S(Xi; θ)]=0 and Var[S(Xi; θ)]= I(θ). The consistency of ML can be shown by applying Khinchine’s LLN

to S(Xi,; θ) and then to Sn(X; θ)=Σi S(Xi,; θ). Then, do a 1st-order Taylor expansion of Sn(X; θ) around Sn(X; θ) and ( - θ) converge together to zero (i.e., expectation).

MLE^θ

))(( −≥ θθ nI)Var( MLE

MLEθ̂

)ˆ )( (X;'S) (X;S

|ˆ || |)ˆ )( (X;'S)ˆ (X;S) (X;S*

MLEnMLEnMLE

θθθθ

εθθθθθθθθθ

<−≤−−+=

MLEθ̂

(3) Theorem: Asymptotic Normality Let the likelihood function be L(X1,X2,…Xn| θ). Under general

conditions, the MLE of θ is asymptotically distributed as Sketch of a proof. Using the CLT, we’ve already established Sn(X; θ) N(0, nI(θ)). Then, using a first order Taylor expansion as before, we get Notice that E[Sn′(xi ; θ)]= -I(θ). Then, apply the LLN to get Sn′ (X; θn*)/n -I(θ). (using θn* θ.) Now, algebra and Slutzky’s theorem for RV get the final result.

( )1)(,ˆ −→ θθθ nINaMLE

Properties of ML Estimators

)ˆ (n

1) (X;'Sn

1) (X;S 1/2*

n1/2n MLEn θθθθ −=

(4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ must be a function of it. That is, depends on the sample observations only through the value of a sufficient statistic.

(5) Invariance. The ML estimate is invariant under functional

transformations. That is, if is the MLE of θ and if g(θ) is a function of θ , then g( ) is the MLE of g(θ) .

Properties of ML Estimators

MLEθ̂

• ML rests on the assumption that the errors follow a particular distribution (OLS is only ML if the errors are normal).

•Q: What happens if we make the wrong assumption? White (Econometrica, 1982) shows that, under broad assumptions about the misspecification of the error process, is still a consistent estimator. The estimation is called Quasi ML.

• But the covariance matrix is no longer I(θ)-1, instead it is given by

Quasi Maximum Likelihood

MLEθ̂

)I()S()S()I(= -1-1 θθ′θθθ ˆ]ˆˆ[ˆ]ˆ[Var

• In general, Wald and LM tests are valid, by using this corrected covariance matrix. But, LR tests are invalid, since they works directly from the value of the likelihood function.

• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (3) Determine a step length to move (4) Check convergence criteria and either stop or go back to (2).

ML Estimation: Numerical Optimization

• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (for example, by dL/dθ). (3) Determine a step length to move (for example, by d2L/dθ2). (4) Check convergence criteria and either stop or go back to (2).

*β1β 2β

• Simple idea: Suppose the first moment (the mean) is generated by the distribution, f(X,θ). The observed moment from a sample of n observations is

Hence, we can retrieve the parameter θ by inverting the distribution function f(X,θ):

1 )()|( mmfxfm ===>= −θθ

11 )/1(

Method of Moments (MM) Estimation

• Example: Mean of a Poisson pdf: f(x) = exp(-λ) λx/x! E[X] = λ => plim (1/N)Σi xi = λ. Then, the MM estimator of λ is the sample mean of X => λMM = x

Method of Moments (MM) Estimation

• Let’s complicate the MM idea: Now, suppose we have a model. This model implies certain knowledge about the moments of the distribution. Then, we invert the model to give us estimates of the unknown parameters of the model, which match the theoretical moments for a given sample.

• Example: Mean of Exponential pdf: f(x,λ) = λ e-λy

E[X] = 1/λ => plim (1/N)Σixi = 1/λ Then, the λMM = 1/ . x

• We have a model Y = h (X,θ), where θ are k parameters. Under this model, we know what some moments of the distribution should be. That is, the model provide us with k conditions (or moments), which should be met:

0))|,(( =θXYgE

• In this case, the (population) first moment of g (Y,X, θ) equals 0. Then, we approximate the k moments –i.e., E(g)- with a sample measure and invert g to get an estimate of θ: is the Method of Moment estimator of θ. Note: In this example we have as many moments (k) as unknown parameters (k). Thus, θ is uniquely and exactly determined.

)0,,(ˆ 1 XYgMM−=θ

MMθ̂

MM Estimation

We start with a model Y = X β + ε. In OLS estimation, we make the assumption that the X’s are orthogonal to the errors. Thus,

0)'( =eXE

The sample moment analogue for each xi is

.0')/1(or 0)/1(1

=−=∑ =eXnexn

And, thus,

MMMM XXYXXYXneXn ββ '')(')/1(0')/1( ==>−==

Therefore, the method of moments estimator, βMM, solves the normal equations. That is, βMM will be identical to the OLS estimator, b.

MM Estimation: Example

• So far, we have assumed that there are as many moments (l ) as unknown parameters (k). The parameters are uniquely and exactly determined.

• If l < k –i.e., less moment conditions than parameters-, we would not be able to solve them for a unique set of parameters (the model would be under identified).

• If l > k –i.e., more moment conditions than parameters-, then all the conditions can not be met at the same time, the model is over identified and we have GMM estimation.

If we can not satisfy all the conditions at the same time, we want to make them all as close to zero as possible at the same time. We have to figure out a way to weight them.

Generalized Method of Moments (GMM)

• Now, we have k parameters but l moment conditions l>k. Thus,

∑ ====

1,...10)()/1()(

,...10))((

• Then, we need to make all l moments as small as possible, simultaneously. Let’s use a weighted least squares criterion:

)()'()( θθθ

mWmqMin =

That is, the weighted squared sum of the moments. The weighting matrix is the lxl matrix W. (Note that we have a quadratic form.) • First order condition:

(l population moments)

(l sample moments)

0)(')'(2 =

∂∂

=GMMmWm

θθθ

• The GMM estimator, θGMM, solves the kx1 system of equations. There is typically no closed form solution for θGMM. It must be obtained through numerical optimization methods.

• If plim =0, and W (not a function of θ) is a positive definite matrix, then θGMM is a consistent estimator of θ.

• The optimal W Any weighting matrix produces a consistent estimator of θ. We can select the most efficient one –i.e., the optimal W. The optimal W is simply the covariance matrix of the moment conditions. Thus,

)(* mVarAsyWWOptimal ==

• Properties of the GMM estimator.

(1) Consistency.

If plim =0, and W (not a function of θ) is a pd matrix, then under some conditions, θGMM θ.

(2) Asymptotic Normality

Under some general condition θGMM N(θ, VGMM), and VGMM=(1/n)[G′V-1G]-1,

where G is the matrix of derivatives of the moments

with respect to the parameters and ))(( 2/1 θmnVarV =

Properties of the GMM estimator

Lars Peter Hansen (1952)

• Recall Bayes’ Theorem:

Bayesian Estimation: Bayes’ Theorem

( ) ( ) ( )( )X

XXProb

Prob|ProbProb θθθ =

- P(θ): Prior probability about parameter θ. - P(X|θ): Probability of observing the data, X, conditioning on θ. This conditional probability is called the likelihood –i.e., probability of event X will be the outcome of the experiment depends on θ. - P(θ |X): Posterior probability -i.e., probability assigned to θ, after X is observed. - P(X): Marginal probability of X. This the prior probability of witnessing the data X under all possible scenarios for θ, and it depends on the prior probabilities given to each θ.

• Example: Courtroom – Guilty vs. Non-guilty G: Event that the defendant is guilty. E: Event that the defendant's DNA matches DNA found at the

crime scene. The jurors, after initial questions, form a personal belief about the

defendant’s guilt. This initial belief is the prior. The jurors, after seeing the DNA evidence (event E), will update their

prior beliefs. This update is the posterior.

• Example: Courtroom – Guilty vs. Non-guilty - P(G): Juror’s personal estimate of the probability that the defendant is guilty, based on evidence other than the DNA match. (Say, .30). - P(E|G): Probability of seeing event E if the defendant is actually guilty. (In our case, it should be near 1.) - P(E): E can happen in two ways: defendant is guilty and thus DNA match is correct or defendant is non-guilty with incorrect DNA match (one in a million chance). - P(G|E): Probability that defendant is guilty given a DNA match.

( ) ( ) ( )( ) 999998.

.7x10.3x1x(.3)1

ProbProb|ProbProb 6- =

EGGEEG

• Implicitly, in our previous discussions about estimation (MLE), we adopted a classical viewpoint.

– We had some process generating random observations. – This random process was a function of fixed, but unknown

parameters. – Then, we designed procedures to estimate these unknown

parameters based on observed data.

• For example, we assume a random process such as CEO compensation. This CEO compensation process can be characterized by a normal distribution.

– We can estimate the parameters of this distribution using maximum likelihood.

Bayesian Estimation: Viewpoints

– The likelihood of a particular sample can be expressed as

– Our estimates of µ and σ2 are then based on the value of each parameter that maximizes the likelihood of drawing that sample

( )( )

−−= ∑ =

1,,,i innn XXXXL µ

σσπσµ

• Turning the classical process around slightly, a Bayesian viewpoint starts with some kind of probability statement about the parameters (a prior). Then, the data, X, are used to update our prior beliefs (a

posterior). – First, assume that our prior beliefs about the distribution

function can be expressed as a probability density function π(θ), where θ is the parameter we are interested in estimating.

– Based on a sample -the likelihood function, L(X,θ)- we can update our knowledge of the distribution using Bayes’ theorem:

( ) ( ) ( )( )

( ) ( )( ) ( )∫

∞−

==θθπθ

θπθθπθθπdXL

XXProb

Thomas Bayes (1702–April 17, 1761)

• Assume that we have a prior of a Bernoulli distribution. Our prior is that P in the Bernoulli distribution is distributed Β(α,β):

( ) ( ) ( ) 11 1,

1,;)( −− −== βα

βαβαπ PP

Bayesian Estimation: Example

( ) ( ) ( ) ( )( )βα

βαβα βα

+ΓΓΓ

=−= ∫ −−1

11 1, dxxxB

( )( ) ( ) ( ) 11 1)( −− −

ΓΓ+Γ

= βα

βαβαπ PPP

• Assume that we are interested in forming the posterior distribution after a single draw, X:

( )( ) ( )

( ) ( ) ( )

( ) ( )( ) ( ) ( )

( )( )∫

−−+

−−−

−ΓΓ+Γ

dPPPPP

PPPPXP

βαβαβα

• Following the original specification of the beta function

( ) ( )

( ) ( )( )1

1X1 and where

++Γ+−Γ+Γ

+−=+=

−=− ∫∫ −−−−+

βαβα

ββαα

βαβα

dPPPdPPP XX

• The posterior distribution, the distribution of P after the observation is then

( ) ( )( ) ( ) ( ) XX PP

XXXP −−+ −

+−Γ+Γ++Γ

= βα

βαβαπ 1

• The Bayesian estimate of P is then the value that minimizes a loss function. Several loss functions can be used, but we will focus on the quadratic loss function consistent with mean square errors

( )( )

[ ]][ˆ

0ˆ2ˆ

ˆˆmin

PPEPPE

=−=∂

−∂

• Taking the expectation of the posterior distribution yields

[ ] ( )( ) ( ) ( )

( )( ) ( ) ( )∫

∫−+

−+−Γ+Γ

dPPPXX

βαβα

• As before, we solve the integral by creating α*=α+X+1 and β*=β-X+1. The integral then becomes

( ) ( ) ( )( )

( ) ( )( )2

111 **

++Γ+−Γ++Γ

=+ΓΓΓ

=−∫ −−

βαβα

βαβαβα XXdPPP

[ ] ( )( )

( )( )

( )( )1

+−Γ+−Γ

+Γ++Γ

++Γ++Γ

βαβα

• Which can be simplified using the fact Γ(α+1)= α Γ(α):

[ ] ( )( )

( )( )

( )( ) ( )

( ) ( )( )

( )( )1

+Γ+Γ+

++Γ++++Γ

++Γ++Γ++Γ

βαα

ααα

βαβαβα

βαβα

• To make this estimation process operational, assume that we have a prior distribution with parameters α=β=1.4968 that yields a beta distribution with a mean P of 0.5 and a variance of the estimate of 0.0625.

• Extending the results to n Bernoulli trials yields

where Y is the sum of the individual Xs or the number of heads in the sample. The estimated value of P then becomes:

( ) ( )( ) ( ) ( ) 11 1 −+−−+ −

+−Γ+Γ++Γ

= nYY PPnYY

nXP βα

βαβαπ

• Suppose in the first draw Y=15 and n=50. This yields an estimated value of P of 0.31129. This value compares with the maximum likelihood estimate of 0.3000. Since the maximum likelihood estimator in this case is unbiased, the results imply that the Bayesian estimator is biased.

chapter 7 estimation - bauer college of business · 2014-08-22 · 2 . criteria for estimators •...

Documents

maximum likelihood covariance estimation with a … · et...

low complexity channel estimation for ofdm systems based...

t, · computation statistical inference e 1 2 e...,, best...

outline - university of california, berkeley€¦ ·...

bayesian wavelet estimators in nonparametric...

properties of point estimators and methods of...

variance estimation of small area estimators in the...

least squares estimators for stochastic differential...

estimation. estimators & estimates estimators are the random...

estimation and forecasting of stock volatility with range -...

estimation theory - university of texas at...

(c) 2007 iupui spea k300 (4392) outline least squares...

copula parameter estimation by ml and md estimators ·...

quantile regression-ratio-type estimators for mean...

estimation: chapter 6 - uic...

the small area estimation problem (contd) · the small area...

a bank of maximum a posteriori (map) estimators for target...

robust autocorrelation estimation - ucsd...

cost estimation 2009. cost estimation “the most...

bayesian estimation of the multifractality parameter for...