chapter 7 estimation - bauer college of business · 2014-08-22 · 2 . criteria for estimators •...

75
1 Chapter 7 Estimation

Upload: others

Post on 10-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

1

Chapter 7 Estimation

Page 2: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

2

Criteria for Estimators

• Main problem in stats: Estimation of population parameters, say θ. • Recall that an estimator is a statistic. That is, any function of the observations {xi} with values in the parameter space.

• There are many estimators of θ. Question: Which is better? • Criteria for Estimators (1) Unbiasedness (2) Efficiency (3) Sufficiency (4) Consistency

Page 3: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

3

Unbiasedness

Definition: Unbiasedness An unbiased estimator, say , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, E[ ] = θ Example: E[ ] = µ E[s2] = σ2

x

Page 4: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

4

Definition: Efficiency / Mean squared error An estimator is efficient if it estimates the parameter of interest in

some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality:

b(θ): E[( - θ) bias in . The MSE is the sum of the variance and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an

unbiased estimator. Note: The most efficient estimator among a group of unbiased

estimators is the one with the smallest variance => BUE.

Efficiency

( ) ( ) 222)]([)ˆ()ˆ()ˆ(ˆˆMSE θθθθθθθθ bVarEEEE +=

−+−=

−=

Page 5: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

5

Now we can compare estimators and select the “best” one. Example: Three different estimators’ distributions

– 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST. 2 is more efficient than 1.

Efficiency

1 2

3

Value of Estimator

1, 2, 3 based on samples of the same size

θ

Page 6: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

6

Relative Efficiency It is difficult to prove that an estimator is the best among all estimators, a relative concept is usually used. Definition: Relative efficiency

Example: Sample mean vs. sample median Variance of sample mean = σ2/n Variance of sample median = πσ2/2n Var[median]/Var[mean] = (πσ2/2n) / (σ2/n) = π/2 = 1.57 The sample median is 1.57 times less efficient than the sample mean.

estimator second of Varianceestimatorfirst of Variance Efficiency Relative =

Page 7: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Asymptotic Efficiency

• We compare two sample statistics in terms of their variances. The statistic with the smallest variance is called efficient. • When we look at asymptotic efficiency, we look at the asymptotic variance of two statistics as n grows. Note that if we compare two consistent estimators, both variances eventually go to zero. Example: Random sampling from the normal distribution

• Sample mean is asymptotically normal[μ,σ2/n] • Median is asymptotically normal [μ,(π/2)σ2/n] • Mean is asymptotically more efficient

Page 8: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

8

• Definition: Sufficiency A statistic is sufficient when no other statistic, which can be calculated from the same sample, provides any additional information as to the value of the parameter of interest. Equivalently, we say that conditional on the value of a sufficient statistic for a parameter, the joint probability distribution of the data does not depend on that parameter. That is, if P(X=x|T(X)=t, θ) = P(X=x|T(X)=t) we say that T is a sufficient statistic. • The sufficient statistic contains all the information needed to estimate the population parameter. It is OK to ‘get rid’ of the original data, while keeping only the value of the sufficient statistic.

Sufficiency

Page 9: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Visualize sufficiency: Consider a Markov chain θ → T(X1, . . . ,Xn) → {X1, . . . ,Xn} (although in classical statistics θ is not a RV). Conditioned on the middle part of the chain, the front and back are independent. Theorem Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is a sufficient statistic for θ if, for every x in the sample space, the ratio of

is a constant as a function of θ. Example: Normal sufficient statistic: Let X1, X2, … Xn be iid N(μ,σ2) where the variance is known. The sample mean, , is the sufficient statistic for μ.

( )( )

p xq t

θθ

Sufficiency

x

Page 10: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Proof: Let’s starting with the joint distribution function

( ) ( )

( )( )

2

221

2

22 2 1

1 exp22

1 exp22

ni

i

ni

ni

xf x

x

µµ

σπσ

µσπσ

=

=

−= −

= −

• Next, add and subtract the sample mean:

( )( )

( )

( )

( ) ( )

2

22 2 1

2 2

12

2 2

1 exp22

1 exp22

ni

ni

n

ii

n

x x xf x

x x n x

µµ

σπσ

µ

σπσ

=

=

− + −= −

− + − = −

Page 11: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Recall that the distribution of the sample mean is

( )( )( )

( )2

1 22 2

1 exp22

n xq T X

n

µθ

σσπ

−= −

• The ratio of the information in the sample to the information in the statistic becomes independent of μ

( )( )( )

( )

( ) ( )

( )( )

2 2

12

2 2

2

1 22 2

1 exp22

1 exp22

n

ii

n

x x n x

f x

q T x n x

n

µ

σπσθ

θ µσσπ

=

− + − − =

−−

( )( )( ) ( )

( )2

11 21 2 22

1 exp22

n

ii

n

x xf x

q T x n

θσθ πσ

=−

− = −

Page 12: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Theorem: Factorization Theorem Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ • Sufficient statistics are not unique. From the factorization theorem it is easy to see that (i) the identity function T(X) = X is a sufficient statistic vector and (ii) if T is a sufficient statistic for θ then so is any 1-1 function of T. Then, we have minimal sufficient statistics. Definition: Minimal sufficiency A sufficient statistic T(X) is called a minimal sufficient statistic if, for any other sufficient statistic T ’(X), T'(X) is a function of T (X).

( ) ( )( ) ( )f x g T x h xθ θ=

Sufficiency

Page 13: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

13

Definition: Consistency The estimator converges in probability to the population parameter being estimated when n (sample size) becomes larger That is, θ. We say that is a consistent estimator of θ. Example: is a consistent estimator of μ (the population mean). • Q: Does unbiasedness imply consistency? No. The first observation of {xn}, x1, is an unbiased estimator of μ. That is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in probability to μ.

Consistency

n^θ →p

x

n^θ

Page 14: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

14

Definition: Squared Error Consistency The sequence { } is a squared-error consistent estimator of θ, if limn→∞ E[( - θ)2] = 0 That is, θ. • Squared-error consistency implies that both the bias and the variance of an estimator approach zero. Thus, squared-error consistency implies consistency.

Squared-Error Consistency

n^θ → ..sm

n^θ

n^θ

Page 15: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Order of a Sequence: Big O and Little o

• “Little o” o(.). A sequence {xn}is o(nδ) (order less than nδ) if |n-δ xn|→ 0, as n → ∞. Example: xn = n3 is o(n4) since |n-4 xn|= 1 /n → 0, as n → ∞. • “Big O” O(.). A sequence {xn} is O(nδ) (at most of order nδ ) if n-δ xn → ψ, as n → ∞ (ψ≠0, constant). Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0. Special case: O(1): constant • Order of a sequence of RV The order of the variance gives the order of the sequence. Example: What is the order of the sequence { }? Var[ ] = σ2/n, which is O(1/n) -or O(n-1).

xx

Page 16: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Root n-Consistency • Q: Let xn be a consistent estimator of θ. But how fast does xn converges to θ ?

The sample mean, , has as its variance σ2/n, which is O(1/n). That is, the convergence is at the rate of n-½. This is called “root n-consistency.” Note: n½ has variance of O(1).

• Definition: nδ convergence? If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ –convergent. Example: Suppose var(xn) is O(1/n2). Then, xn is n–convergent. The usual convergence is root n. If an estimator has a faster (higher degree of) convergence, it’s called super-consistent.

x

x

Page 17: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Estimation • Two philosophies regarding models (assumptions) in statistics: (1) Parametric statistics. It assumes data come from a type of probability distribution and makes inferences about the parameters of the distribution. Models are parameterized before collecting the data. Example: Maximum likelihood estimation.

(2) Non-parametric statistics. It assumes no probability distribution –i.e., they are “distribution free.” Models are not imposed a priori, but determined by the data. Examples: histograms, kernel density estimation. • In general, parametric statistics makes more assumptions.

Page 18: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Least Squares Estimation

• Long history: Gauss (1795, 1801) used it in astronomy. • Idea: There is a functional form relating Y and k variables X. This function depends on unknown parameters, θ. The relation between Y and X is not exact. There is an error, ε. We will estimate the parameters θ by minimizing the sum of squared errors. (1) Functional form known

yi = f(xi, θ) + εi (2) Typical Assumptions

- f(x, θ) is correctly specified. For example, f(x, θ) = X β - X are numbers with full rank --or E(ε|X) = 0. That is, (ε ⊥ x) - ε ~ iid D(0, σ2 I)

Page 19: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Least Squares Estimation

• Objective function: S(xi, θ) =Σi εi2

• We want to minimize w.r.t to θ. That is, minθ {S(xi, θ) =Σi εi

2 = Σi [yi - f(xi, θ)]2 } => d S(xi, θ)/d θ = - 2 Σi [yi - f(xi, θ)] f ‘(xi, θ) f.o.c. => - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) =0 Note: The f.o.c. deliver the normal equations. The solution to the normal equation, θLS, is the LS estimator. The estimator θLS is a function of the data (yi ,xi).

Page 20: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Least Squares Estimation

Suppose we assume a linear functional form. That is, f(x, θ) = Xβ. Using linear algebra, the objective function becomes S(xi, θ) =Σi εi

2 = ε’ε = (y- X β)’ (y- X β) The f.o.c. - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)’ X =0 where b = βOLS. (Ordinary LS. Ordinary=linear) Solving for b => b = (X’ X)-1 X’ y Note: b is a (linear) function of the data (yi ,xi).

Page 21: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Least Squares Estimation The LS estimator of βLS when f(x, θ) = X β is linear is b = (X′X)-1 X′ y Note: b is a (linear) function of the data (yi ,xi). Moreover, b = (X′X)-1 X′ y = (X′X)-1 X′ (Xβ + ε) = β +(X′X)-1 X′ε Under the typical assumptions, we can establish properties for b. 1) E[b|X]= β 2) Var[b|X] = E[(b-β) (b-β)′|X] =(X′X)-1 X’E[ε ε′|X] X(X′X)-1 = σ2 (X′X)-1 Under the typical assumptions, Gauss established that b is BLUE. 3) If ε|X ~ iid N(0, σ2In) => b|X ~iid N(β, σ2 (X’ X)-1) 4) With some additional assumptions, we can use the CLT to get b|X N(β, σ2/n (X’ X/n)-1) →a

Page 22: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Maximum Likelihood Estimation

• Idea: Assume a particular distribution with unknown parameters. Maximum likelihood (ML) estimation chooses the set of parameters that maximize the likelihood of drawing a particular sample. • Consider a sample (X1, ... , Xn) which is drawn from a pdf f(X|θ) where θ are parameters. If the Xi’s are independent with pdf f(Xi|θ) the joint probability of the whole sample is:

)|Xf(=)|X...Xf(XL i

n

=1in1 θθθ ∏=)|(

The function L(X| θ) --also written as L(X; θ)-- is called the likelihood function. This function can be maximized with respect to θ to produce maximum likelihood estimates ( ). MLE

Page 23: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Maximum Likelihood Estimation

• It is often convenient to work with the Log of the likelihood function. That is, ln L(X|θ) = Σi ln f(Xi| θ).

• The ML estimation approach is very general. Now, if the model is not correctly specified, the estimates are sensitive to the misspecification.

Ronald Fisher (1890 – 1962)

Page 24: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Maximum Likelihood: Example I

Let the sample be X={5, 6, 7, 8, 9, 10}drawn from a Normal(μ,1). The probability of each of these points based on the unknown mean, μ, can be written as:

( ) ( )

( ) ( )

( ) ( )

−−=

−−=

−−=

210exp

21|10

26exp

21|6

25exp

21|5

2

2

2

µπ

µ

µπ

µ

µπ

µ

f

f

f

Assume that the sample is independent.

Page 25: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Then, the joint pdf function can be written as: The value of µ that maximize the likelihood function of the sample can then be defined by It easier, however, to maximize ln L(X|μ). That is,

( )( )

( ) ( ) ( )

−−

−−

−−=

− 210

26

25exp

21|

222

25

µµµπ

µ XL

Maximum Likelihood: Example I

( )µµ

|max XL

( )( ) ( ) ( ) ( )

( ) ( ) ( )_

222

61098765ˆ

01065

210

26

25|lnmax

x

KXL

MLE =+++++

=µ−++µ−+µ−

µ−−

µ−−

µ−−

µ∂∂

⇒µµ

Page 26: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Let’s generalize this example to an i.i.d. sample X={X1, X2,..., XT}drawn from a Normal(μ,σ2). Then, the joint pdf function is: Then, taking logs, we have: We take first derivatives:

Maximum Likelihood: Example I

)X)X µ−′µ−σ

−σ−π−=µ−σ

−πσ−= ∑=

((2

1ln2

2ln2

)(2

12ln2 2

2

1

22

2 TTXTLT

ii

∏∏=

=

σ

µ−−πσ=

σ

µ−−

πσ=

T

i

iTT

i

i XXL1

2

22/2

12

2

2 2)(

exp)2(2

)(exp

2

1

∑∑

=

==

µ−σ

−=σ∂

µ−σ

=−µ−σ

−=µ∂

T

ii

T

ii

T

ii

XTL

XXL

1

2422

12

12

)(2

12

ln

)(1)1()(22

1

Page 27: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Then, we have the f.o.c. and jointly solve for the ML estimators: Note: The MLE of μ is the sample mean. Therefore, it is unbiased. Note: The MLE of σ2 is not s2. Therefore, it is biased!

Maximum Likelihood: Example I

XXT

XL T

iiMLE

T

iMLEi

MLE

==µ⇒=µ−σ

=µ∂

∂ ∑∑== 11

21ˆ0)ˆ(

ˆ1)1(

∑∑==

−=σ⇒=µ−σ

−=σ∂

∂ T

ii

T

iMLEi XX

TXTL

MLE

MLEMLE 1

22

1

2422 )(1ˆ0)ˆ(

ˆ21

ˆ2ln)2(

Page 28: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• We will work the previous example with matrix notation. Suppose we assume: where Xi is a 1xk vector of exogenous numbers and β is a kx1 vector of unknown parameters. Then, the joint likelihood function becomes: • Then, taking logs, we have the log likelihood function::

∏∏=

=

σ

ε−πσ=

σ

ε−

πσ=

T

i

iTT

i

iL1

2

22/2

12

2

2 2exp)2(

2exp

2

1

),0(~

),0(~2

2

T

iiii

Nor

NXy

IεεXβy

β

σ+=

σεε+=

)(2

12ln22

12ln2

ln 22

1

22

2 Xβ(y)Xβy −′−σ

−πσ−=εσ

−πσ−= ∑=

TTLT

ii

Maximum Likelihood: Example II

Page 29: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• The joint likelihood function becomes: • We take first derivatives of the log likelihood wrt β and σ2: • Using the f.o.c., we jointly estimate β and σ2: :

])[2

1()2

1(2

ln22

1

2422 TTL T

ii −

σσ=ε

σ−−

σ−=

σ∂∂ ∑

=

εε'

εX'x 22'

1

1/221ln

σ−=σε−=

β∂∂ ∑

=i

T

ii

L

)(2

1ln2

2ln22

12ln2

ln 22

1

22

2 Xβ(y)Xβy −′−σ

−σ−π−=εσ

−πσ−= ∑=

TTTLT

ii

Maximum Likelihood: Example II

∑=

−==σ⇒=−

σσ=

σ∂∂ T

i

MLEiiMLE

MLEMLE Ty

TTL

1

22

222)ˆ(ˆ0]

ˆ)[

ˆ21(ln βXee'ee'

yX'XXββXyX'εX' 122 )'(ˆ0)ˆ(11ln −=⇒=−

σ=

σ−=

β∂∂

MLEMLEL

Page 30: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Definition: Score (or efficient score)

S(X; θ) is called the score of the sample. It is the vector of partial derivatives (the gradient), with respect to the parameter θ. If we have k parameters, the score will have a kx1 dimension.

Definition: Fisher information for a single sample:

I(θ) is sometimes just called information. It measures the shape of the log f(X|θ).

ML: Score and Information Matrix

∑ ===

n

ii ))(f(x))(L(X)S(X

1

|log|log;δθ

θδδθ

θδθ

)(|log 2

θ=

θ∂θ∂ I))(f(XE

Page 31: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• The concept of information can be generalized for the k-parameter case. In this case:

This is kxk matrix.

If L is twice differentiable with respect to θ, and under certain regularity conditions, then the information may also be written as9

I(θ) is called the information matrix (negative Hessian). It measures the shape of the likelihood function.

ML: Score and Information Matrix

)('|logloglog T

θIθθ

θθθ

=

∂∂

δ=

∂∂

∂∂ ))(L(X-ELLE

2

)(loglog T

θ=

∂∂

∂∂ I

θθLLE

Page 32: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Properties of S(X; θ):

(1) E[S(X; θ)]=0.

ML: Score and Information Matrix

0)];([0);();(log

0);();();(

1

0);(1);(

=⇒=∂

=∂

=∂

∂⇒=

∫∫∫

θθθ

θ

θθ

θθ

θθθ

xSEdxxfxf

dxxfxfxf

dxxfdxxf

∑ ===

n

ii ))(f(x))(L(X)S(X

1

|log|log;δθ

θδδθ

θδθ

Page 33: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

(2) Var[S(X; θ)]= n I(θ)

ML: Score and Information Matrix

)(]);(log[)];([

)('

);(log);(log

0);('

);(log);();(log

0);('

);(log);();();(

1);(log

0);('

);(log);();(log

:more once integral above theatedifferenti sLet'

0);();(log

22

22

2

2

θθ

θθ

θθθ

θθ

θ

θθθ

θθθ

θ

θθθ

θθθ

θθθ

θ

θθθ

θθ

θθ

θ

θθ

θ

InxfVarnXSVar

IxfExfE

dxxfxfdxxfxf

dxxfxfdxxfxfxf

xf

dxxfxfdxxfxf

dxxfxf

=∂

∂=

=

∂∂∂

−=

∂∂

=∂∂

∂+

∂∂

=∂∂

∂+

∂∂

=∂∂

∂+

∂∂

∂∂

=∂

∫∫

∫∫

∫∫

Page 34: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

(3) If S(xi; θ) are i.i.d. (with finite first and second moments), then we can apply the CLT to get:

Sn(X; θ) = Σi S(xi; θ) N(0, n I(θ)).

Note: This an important result. It will drive the distribution of MLE estimators.

ML: Score and Information Matrix

→a

Page 35: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Again, we assume: • Taking logs, we have the log likelihood function: • The score function is –first derivatives of log L wrt θ=(β,σ2):

ML: Score and Information Matrix – Example

),0(~

),0(~2

2

T

iiii

Nor

NXy

IεεXβy

β

σ+=

σεε+=

])[2

1()2

1(2

ln22

1

2422 TTL T

ii −

σσ=ε

σ−−

σ−=

σ∂∂ ∑

=

εε'

εX'x 22'

1

1/221ln

σ−=σε−=

β∂∂ ∑

=i

T

ii

L

)(2

1ln2

2ln22

12ln2

ln 22

1

22

2 Xβ(y)Xβy −′−σ

−σ−π−=εσ

−πσ−= ∑=

TTTLT

ii

Page 36: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Then, we take second derivatives to calculate I(θ): : • Then,

∑=

εσ

−=σ∂β∂

∂ T

iii xL

142 '1

'ln

XXL T

iii '1/'

'ln

22

1

2

σ=σ−=

β∂β∂∂ ∑

=

xx

]2[2

1))(2

1(][2

1'

ln24422422 TTL

−σσ

−=σ

−σ

+−σσ

−=σ∂σ∂

∂ εε'εε'εε'

σ

σ=θ∂θ∂

∂−=θ

4

2

20

0)'1(]

'ln[)( T

XXLEI

ML: Score and Information Matrix – Example

Page 37: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

In deriving properties (1) and (2), we have made some implicit assumptions, which are called regularity conditions:

(i) θ lies in an open interval of the parameter space, Ω.

(ii) The 1st derivative and 2nd derivatives of f(X; θ) w.r.t. θ exist.

(iii) L(X; θ) can be differentiated w.r.t. θ under the integral sign.

(iv) E[S(X; θ) 2]>0, for all θ in Ω.

(v) T(X) L(X; θ) can be differentiated w.r.t. θ under the integral sign.

Recall: If S(X; θ) are i.i.d. and regularity conditions apply, then we can apply the CLT to get:

S(X; θ) N(0, n I(θ))

ML: Score and Information Matrix

→a

Page 38: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Theorem: Cramer-Rao inequality Let the random sample (X1, ... , Xn) be drawn from a pdf f(X|θ) and let T=T(X1, ... , Xn) be a statistic such that E[T]=u(θ), differentiable in θ. Let b(θ)= u(θ) - θ, the bias in T. Assume regularity conditions. Then,

Regularity conditions:

(1) θ lies in an open interval Ω of the real line.

(2) For all θ in Ω, δf(X|θ)/δθ is well defined.

(3) ∫L(X|θ)dx can be differentiated wrt. θ under the integral sign

(4) E[S(X;θ)2]>0, for all θ in Ω

(5) ∫T(X) L(X|θ)dx can be differentiated wrt. θ under the integral sign

)()]('1[

)()]('[ 22

θθ

θθ

nIb

nIuVar(T) +

=≥

ML: Cramer-Rao inequality

Page 39: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

The lower bound for Var(T) is called the Cramer-Rao (CR) lower bound.

Corollary: If T(X) is an unbiased estimator of θ, then

Note: This theorem establishes the superiority of the ML estimate over all others. The CR lower bound is the smallest theoretical variance. It can be shown that ML estimates achieve this bound, therefore, any other estimation technique can at best only equal it.

)()]('1[

)()]('[ 22

θθ

θθ

nIb

nIuVar(T) +

=≥

ML: Cramer-Rao inequality

1))(( −≥ θnIVar(T)

Page 40: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Proof: For any T(X) and S(X;θ) we have

[Cov(T,S)]2 ≤ Var(T) Var(S) (Cauchy-Schwarz inequality)

Since E[S]=0, Cov(T,S)=E[TS].

Also, u(θ) = E[T] = ∫ T L(X;θ) dx. Differentiating both sides:

u’(θ) = ∫ T δL(X;θ)/δθ dx = ∫ T [1/L δL(X;θ)/δθ] L dx

= ∫ T S L dx = E[TS] = Cov(TS)

Substituting in the Cauchy-Schwarz inequality:

[u’(θ)]2 ≤ Var(T) n I(θ) => Var(T) ≥[u’(θ)]2/[n I(θ)] ■

ML: Cramer-Rao inequality

Page 41: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Note: For an estimator to achieve the CR lower bound, we need

[Cov(T,S)]2 = Var(T) Var(S).

This is possible if T is a linear function of S. That is,

T(X) = α(θ) S(X;θ) + β(θ)

Since E[T] = α(θ) E[S(X;θ)] + β(θ) = β(θ) . Then,

S(X;θ) = δ log L(X;θ)/δθ =[T(X) - β(θ)]/ α(θ).

Integrating both sides wrt to θ:

log L(X;θ) = U(X) – T(X) A(θ)+ B(θ)

That is, L(X;θ) = exp{ΣiU(Xi) – A(θ) ΣiT(Xi) + n B(θ)}

Or, f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}

ML: Cramer-Rao inequality

Page 42: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)}

That is, the exponential (Pitman-Koopman-Darmois) family of distributions attain the CR lower bound.

• Most of the distributions we have seen belong to this family: normal, exponential, gamma, chi-square, beta, Weibull (if the shape parameter is known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known parameter r), and geometric.

• Note: The Chapman–Robbins bound is a lower bound on the variance of estimators of θ. It generalizes the Cramér–Rao bound. It is tighter and can be applied to more situations –for example, when I(θ) does not exist. However, it is usually more difficult to compute.

ML: Cramer-Rao inequality

Page 43: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• When we have k parameters, then covariance matrix of the estimator T(X) has a CR lower bound given by: Note: In matrix notation, the inequality A ≥ B means the matrix A-B is positive semidefinite. If T(X) is unbiased, then

T1 )()()())(

θθθI

θθXT

∂∂

∂∂

≥ − uuCovar(

Cramer-Rao inequality: Multivariate Case

1)())( −≥ θIXTCovar(

C. R. Rao (1920, India) & Harald Cramer (1893-1985, Sweden)

Page 44: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

We want to check if the sample mean and s2 for an i.i.d. sample X={X1, X2,..., XT}drawn from N(μ,σ2) achieve the CR lower bound. Recall:

Since the sample mean and s2 are unbiased, the CR lower bound is given by:

Then,

We have already derived that Var( ) = σ2/n and Var(s2) = 2 σ4/(n-1). Then, the sample mean achieves its CR bound, but s2 does not..

1)( −θ≥ IT )Covar(

Cramer-Rao inequality: Example

σ

σ=∂∂

∂−=θ

4

2

20

0]

'ln[)( n

nLEθθ

I

nVar(s

n)XVar(

42

2 2)& σ≥

σ≥

X

X

Page 45: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• We split the parameter vector θ into two vectors:

),L(=)L( 21 θθθ

Sometimes, we can derive a formulae for the ML estimate of θ2, say:

)g(= 12 θθ

If this is possible, we can write the Likelihood function as

)(L=))g(,L(=),L( 1*

1121 θθθθθ

This is the concentrated likelihood function. • This process is often useful as it reduces the number of parameters needed to be estimated.

Concentrated ML

Page 46: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• The normal log likelihood function can be written as:

( ) ( )∑ =−−−=

n

i iXnL1

22

22

21ln

2),(ln µ

σσσµ

Concentrated ML: Example

• This expression can be solved for the optimal choice of σ2 by differentiating with respect to σ2:

( )( )

( )

( )∑∑

=

=

=

−=⇒

=−+−⇒

=−+−=∂

n

i iMLE

n

i i

n

i i

Xn

Xn

XnL

122

122

12

2222

2

0

02

12

),(ln

µσ

µσ

µσσσ

σµ

Page 47: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Substituting this result into the original log likelihood produces:

( )

( )( )

( )2

1ln2

12

1

1ln2

)(ln

12

12

12

12

nXn

n

XX

n

Xn

nL

n

i i

n

i in

j j

n

i i

−−=

−−

−−=

∑∑

=

=

=

=

µ

µµ

µµ

Concentrated ML: Example

• Intuitively, the ML estimator of µ is the value that minimizes the MSE of the estimator. Thus, the least squares estimate of the mean of a normal distribution is the same as the ML estimator under the assumption that the sample is i.i.d.

Page 48: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Properties of ML Estimators (1) Efficiency. Under general conditions, we have that

The right-hand side is the Cramer-Rao lower bound (CR-LB). If an

estimator can achieve this bound, ML will produce it. (2) Consistency. We know that E[S(Xi; θ)]=0 and Var[S(Xi; θ)]= I(θ). The consistency of ML can be shown by applying Khinchine’s LLN

to S(Xi,; θ) and then to Sn(X; θ)=Σi S(Xi,; θ). Then, do a 1st-order Taylor expansion of Sn(X; θ) around Sn(X; θ) and ( - θ) converge together to zero (i.e., expectation).

MLE^θ

1^

))(( −≥ θθ nI)Var( MLE

MLEθ̂

)ˆ )( (X;'S) (X;S

|ˆ || |)ˆ )( (X;'S)ˆ (X;S) (X;S*

nn

**nnn

MLEn

MLEnMLEnMLE

θθθθ

εθθθθθθθθθ

−=

<−≤−−+=

MLEθ̂

Page 49: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

(3) Theorem: Asymptotic Normality Let the likelihood function be L(X1,X2,…Xn| θ). Under general

conditions, the MLE of θ is asymptotically distributed as Sketch of a proof. Using the CLT, we’ve already established Sn(X; θ) N(0, nI(θ)). Then, using a first order Taylor expansion as before, we get Notice that E[Sn′(xi ; θ)]= -I(θ). Then, apply the LLN to get Sn′ (X; θn*)/n -I(θ). (using θn* θ.) Now, algebra and Slutzky’s theorem for RV get the final result.

( )1)(,ˆ −→ θθθ nINaMLE

Properties of ML Estimators

)ˆ (n

1) (X;'Sn

1) (X;S 1/2*

n1/2n MLEn θθθθ −=

→p

→p

→p

Page 50: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

(4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ must be a function of it. That is, depends on the sample observations only through the value of a sufficient statistic.

(5) Invariance. The ML estimate is invariant under functional

transformations. That is, if is the MLE of θ and if g(θ) is a function of θ , then g( ) is the MLE of g(θ) .

Properties of ML Estimators

MLEθ̂

MLEθ̂

MLEθ̂

Page 51: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• ML rests on the assumption that the errors follow a particular distribution (OLS is only ML if the errors are normal).

•Q: What happens if we make the wrong assumption? White (Econometrica, 1982) shows that, under broad assumptions about the misspecification of the error process, is still a consistent estimator. The estimation is called Quasi ML.

• But the covariance matrix is no longer I(θ)-1, instead it is given by

Quasi Maximum Likelihood

MLEθ̂

)I()S()S()I(= -1-1 θθ′θθθ ˆ]ˆˆ[ˆ]ˆ[Var

• In general, Wald and LM tests are valid, by using this corrected covariance matrix. But, LR tests are invalid, since they works directly from the value of the likelihood function.

Page 52: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (3) Determine a step length to move (4) Check convergence criteria and either stop or go back to (2).

ML Estimation: Numerical Optimization

Page 53: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (for example, by dL/dθ). (3) Determine a step length to move (for example, by d2L/dθ2). (4) Check convergence criteria and either stop or go back to (2).

ML Estimation: Numerical Optimization

Page 54: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

L

Lu

*β1β 2β

ML Estimation: Numerical Optimization

Page 55: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Simple idea: Suppose the first moment (the mean) is generated by the distribution, f(X,θ). The observed moment from a sample of n observations is

Hence, we can retrieve the parameter θ by inverting the distribution function f(X,θ):

111

1 )()|( mmfxfm ===>= −θθ

∑=

=n

iixnm

11 )/1(

Method of Moments (MM) Estimation

• Example: Mean of a Poisson pdf: f(x) = exp(-λ) λx/x! E[X] = λ => plim (1/N)Σi xi = λ. Then, the MM estimator of λ is the sample mean of X => λMM = x

Page 56: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

Method of Moments (MM) Estimation

• Let’s complicate the MM idea: Now, suppose we have a model. This model implies certain knowledge about the moments of the distribution. Then, we invert the model to give us estimates of the unknown parameters of the model, which match the theoretical moments for a given sample.

• Example: Mean of Exponential pdf: f(x,λ) = λ e-λy

E[X] = 1/λ => plim (1/N)Σixi = 1/λ Then, the λMM = 1/ . x

Page 57: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• We have a model Y = h (X,θ), where θ are k parameters. Under this model, we know what some moments of the distribution should be. That is, the model provide us with k conditions (or moments), which should be met:

0))|,(( =θXYgE

• In this case, the (population) first moment of g (Y,X, θ) equals 0. Then, we approximate the k moments –i.e., E(g)- with a sample measure and invert g to get an estimate of θ: is the Method of Moment estimator of θ. Note: In this example we have as many moments (k) as unknown parameters (k). Thus, θ is uniquely and exactly determined.

)0,,(ˆ 1 XYgMM−=θ

MMθ̂

MM Estimation

Page 58: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

We start with a model Y = X β + ε. In OLS estimation, we make the assumption that the X’s are orthogonal to the errors. Thus,

0)'( =eXE

The sample moment analogue for each xi is

.0')/1(or 0)/1(1

=−=∑ =eXnexn

n

t tit

And, thus,

MMMM XXYXXYXneXn ββ '')(')/1(0')/1( ==>−==

Therefore, the method of moments estimator, βMM, solves the normal equations. That is, βMM will be identical to the OLS estimator, b.

MM Estimation: Example

Page 59: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• So far, we have assumed that there are as many moments (l ) as unknown parameters (k). The parameters are uniquely and exactly determined.

• If l < k –i.e., less moment conditions than parameters-, we would not be able to solve them for a unique set of parameters (the model would be under identified).

• If l > k –i.e., more moment conditions than parameters-, then all the conditions can not be met at the same time, the model is over identified and we have GMM estimation.

If we can not satisfy all the conditions at the same time, we want to make them all as close to zero as possible at the same time. We have to figure out a way to weight them.

Generalized Method of Moments (GMM)

Page 60: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Now, we have k parameters but l moment conditions l>k. Thus,

∑ ====

==n

t j

j

ljmnm

ljmE

1,...10)()/1()(

,...10))((

θθ

θ

• Then, we need to make all l moments as small as possible, simultaneously. Let’s use a weighted least squares criterion:

)()'()( θθθ

mWmqMin =

That is, the weighted squared sum of the moments. The weighting matrix is the lxl matrix W. (Note that we have a quadratic form.) • First order condition:

Generalized Method of Moments (GMM)

(l population moments)

(l sample moments)

0)(')'(2 =

∂∂

=GMMmWm

GMM

θθθ

θθ

Page 61: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• The GMM estimator, θGMM, solves the kx1 system of equations. There is typically no closed form solution for θGMM. It must be obtained through numerical optimization methods.

• If plim =0, and W (not a function of θ) is a positive definite matrix, then θGMM is a consistent estimator of θ.

• The optimal W Any weighting matrix produces a consistent estimator of θ. We can select the most efficient one –i.e., the optimal W. The optimal W is simply the covariance matrix of the moment conditions. Thus,

Generalized Method of Moments (GMM)

)(θm

)(* mVarAsyWWOptimal ==

Page 62: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Properties of the GMM estimator.

(1) Consistency.

If plim =0, and W (not a function of θ) is a pd matrix, then under some conditions, θGMM θ.

(2) Asymptotic Normality

Under some general condition θGMM N(θ, VGMM), and VGMM=(1/n)[G′V-1G]-1,

where G is the matrix of derivatives of the moments

with respect to the parameters and ))(( 2/1 θmnVarV =

Properties of the GMM estimator

)(θm

Lars Peter Hansen (1952)

→p

→a

Page 63: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Recall Bayes’ Theorem:

Bayesian Estimation: Bayes’ Theorem

( ) ( ) ( )( )X

XXProb

Prob|ProbProb θθθ =

- P(θ): Prior probability about parameter θ. - P(X|θ): Probability of observing the data, X, conditioning on θ. This conditional probability is called the likelihood –i.e., probability of event X will be the outcome of the experiment depends on θ. - P(θ |X): Posterior probability -i.e., probability assigned to θ, after X is observed. - P(X): Marginal probability of X. This the prior probability of witnessing the data X under all possible scenarios for θ, and it depends on the prior probabilities given to each θ.

Page 64: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Example: Courtroom – Guilty vs. Non-guilty G: Event that the defendant is guilty. E: Event that the defendant's DNA matches DNA found at the

crime scene. The jurors, after initial questions, form a personal belief about the

defendant’s guilt. This initial belief is the prior. The jurors, after seeing the DNA evidence (event E), will update their

prior beliefs. This update is the posterior.

Bayesian Estimation: Bayes’ Theorem

Page 65: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Example: Courtroom – Guilty vs. Non-guilty - P(G): Juror’s personal estimate of the probability that the defendant is guilty, based on evidence other than the DNA match. (Say, .30). - P(E|G): Probability of seeing event E if the defendant is actually guilty. (In our case, it should be near 1.) - P(E): E can happen in two ways: defendant is guilty and thus DNA match is correct or defendant is non-guilty with incorrect DNA match (one in a million chance). - P(G|E): Probability that defendant is guilty given a DNA match.

Bayesian Estimation: Bayes’ Theorem

( ) ( ) ( )( ) 999998.

.7x10.3x1x(.3)1

ProbProb|ProbProb 6- =

+==

EGGEEG

Page 66: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Implicitly, in our previous discussions about estimation (MLE), we adopted a classical viewpoint.

– We had some process generating random observations. – This random process was a function of fixed, but unknown

parameters. – Then, we designed procedures to estimate these unknown

parameters based on observed data.

• For example, we assume a random process such as CEO compensation. This CEO compensation process can be characterized by a normal distribution.

– We can estimate the parameters of this distribution using maximum likelihood.

Bayesian Estimation: Viewpoints

Page 67: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

– The likelihood of a particular sample can be expressed as

– Our estimates of µ and σ2 are then based on the value of each parameter that maximizes the likelihood of drawing that sample

( )( )

( )

−−= ∑ =

2

12

22

221 2

1exp2

1,,,i innn XXXXL µ

σσπσµ

Bayesian Estimation: Viewpoints

Page 68: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Turning the classical process around slightly, a Bayesian viewpoint starts with some kind of probability statement about the parameters (a prior). Then, the data, X, are used to update our prior beliefs (a

posterior). – First, assume that our prior beliefs about the distribution

function can be expressed as a probability density function π(θ), where θ is the parameter we are interested in estimating.

– Based on a sample -the likelihood function, L(X,θ)- we can update our knowledge of the distribution using Bayes’ theorem:

Bayesian Estimation: Viewpoints

( ) ( ) ( )( )

( ) ( )( ) ( )∫

∞−

==θθπθ

θπθθπθθπdXL

XLX

XXProb

|Prob

Thomas Bayes (1702–April 17, 1761)

Page 69: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Assume that we have a prior of a Bernoulli distribution. Our prior is that P in the Bernoulli distribution is distributed Β(α,β):

( ) ( ) ( ) 11 1,

1,;)( −− −== βα

βαβαπ PP

BPfP

Bayesian Estimation: Example

( ) ( ) ( ) ( )( )βα

βαβα βα

+ΓΓΓ

=−= ∫ −−1

0

11 1, dxxxB

( )( ) ( ) ( ) 11 1)( −− −

ΓΓ+Γ

= βα

βαβαπ PPP

Page 70: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Assume that we are interested in forming the posterior distribution after a single draw, X:

( )( ) ( )

( ) ( ) ( )

( ) ( )( ) ( ) ( )

( )( )∫

−−+

−−+

−−−

−−−

−=

−ΓΓ+Γ

−ΓΓ+Γ

−=

1

0

1

1

1

0

111

111

1

1

11

11

dPPP

PP

dPPPPP

PPPPXP

XX

XX

XX

XX

βα

βα

βα

βα

βαβαβα

βα

π

Bayesian Estimation: Example

Page 71: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Following the original specification of the beta function

( ) ( )

( ) ( )( )1

1X1 and where

11

**

1

0

111

0

1 **

++Γ+−Γ+Γ

=

+−=+=

−=− ∫∫ −−−−+

βαβα

ββαα

βαβα

XXX

dPPPdPPP XX

Bayesian Estimation: Example

• The posterior distribution, the distribution of P after the observation is then

( ) ( )( ) ( ) ( ) XX PP

XXXP −−+ −

+−Γ+Γ++Γ

= βα

βαβαπ 1

11 1

Page 72: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• The Bayesian estimate of P is then the value that minimizes a loss function. Several loss functions can be used, but we will focus on the quadratic loss function consistent with mean square errors

( )( )

[ ]][ˆ

0ˆ2ˆ

ˆˆmin

2

2

ˆ

PEP

PPEP

PPEPPE

P

=⇒

=−=∂

−∂

Bayesian Estimation: Example

• Taking the expectation of the posterior distribution yields

[ ] ( )( ) ( ) ( )

( )( ) ( ) ( )∫

∫−+

−+

−+−Γ+Γ

++Γ=

−+−Γ+Γ

++Γ=

1

0

1

0

11

1

11

1

dPPPXX

dPPPXX

PE

XX

XX

βα

βα

βαβα

βαβα

Page 73: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• As before, we solve the integral by creating α*=α+X+1 and β*=β-X+1. The integral then becomes

( ) ( ) ( )( )

( ) ( )( )2

111 **

**1

0

11 **

++Γ+−Γ++Γ

=+ΓΓΓ

=−∫ −−

βαβα

βαβαβα XXdPPP

[ ] ( )( )

( )( )

( )( )1

1121

+−Γ+−Γ

+Γ++Γ

++Γ++Γ

=XX

XXPE

ββ

αα

βαβα

Bayesian Estimation: Example

• Which can be simplified using the fact Γ(α+1)= α Γ(α):

[ ] ( )( )

( )( )

( )( ) ( )

( ) ( )( )

( )( )1

1111

21

+++

=

+Γ+Γ+

++Γ++++Γ

=+Γ

++Γ++Γ++Γ

=

βαα

ααα

βαβαβα

αα

βαβα

XX

XXX

XPE

Page 74: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• To make this estimation process operational, assume that we have a prior distribution with parameters α=β=1.4968 that yields a beta distribution with a mean P of 0.5 and a variance of the estimate of 0.0625.

• Extending the results to n Bernoulli trials yields

where Y is the sum of the individual Xs or the number of heads in the sample. The estimated value of P then becomes:

Bayesian Estimation: Example

( ) ( )( ) ( ) ( ) 11 1 −+−−+ −

+−Γ+Γ++Γ

= nYY PPnYY

nXP βα

βαβαπ

nYP

+++

=βα

αˆ

Page 75: Chapter 7 Estimation - Bauer College of Business · 2014-08-22 · 2 . Criteria for Estimators • Main problem in stats: Estimation of population parameters, say θ. • Recall that

• Suppose in the first draw Y=15 and n=50. This yields an estimated value of P of 0.31129. This value compares with the maximum likelihood estimate of 0.3000. Since the maximum likelihood estimator in this case is unbiased, the results imply that the Bayesian estimator is biased.

Bayesian Estimation: Example