stat 566 fall 2013 statistical inference lecture notesjwen4/files/stat566notes.pdf · stat 566 fall...

24
STAT 566 Fall 2013 Statistical Inference Lecture Notes Junfeng Wen Department of Computing Science University of Alberta [email protected] Last update: January 6, 2016 Contents 1 Lecture 1: Introduction 2 1.1 Incomplete Repertory of Tasks .......................... 2 1.2 Decision-theoretic Framework ........................... 2 2 Lecture 2: Evaluation of Statistical Procedures I 2 2.1 How to compare δ? ................................. 2 2.2 Comparing risk function I: Bayes risk ...................... 3 2.3 Bayes theorem ................................... 3 2.4 Bayes risk revisited ................................. 3 3 Lecture 3: Location Estimation, Bayes Rules for Parametric Models 4 3.1 Prerequisites & Bayes risk revisited ........................ 4 3.2 Estimation in parametric models ......................... 4 4 Lecture 4: Evaluation of Statistical Procedures II 5 4.1 Comparing risk function II: minimax ....................... 5 4.2 Connection between minimax and Bayes ..................... 5 4.3 Admissibility .................................... 6 4.4 Unbiasedness .................................... 6 5 Lecture 5: Building Statistical Procedure I 7 5.1 Sufficient statistics ................................. 7 5.2 Complete statistics ................................. 9 5.3 Cram´ er-Rao bound ................................. 10 6 Lecture 6: Building Statistical Procedure II 13 6.1 Substitution principle ............................... 13 6.2 Consistency ..................................... 15 6.3 Asymptotic normality ............................... 15 6.4 Maximum likelihood estimate ........................... 16 7 Lecture 7: Estimating the precision of estimates 16 7.1 Bootstrap ...................................... 16 7.2 Delta method .................................... 18 8 Lecture 8: Confidence interval 18 8.1 Bayesian confidence/probability intervals ..................... 18 8.2 General confidence intervals ............................ 18 9 Lecture 9: Hypothesis testing 20 9.1 Setup ........................................ 20 9.2 Testing evaluation ................................. 20 9.3 p-value ........................................ 21 1

Upload: phungngoc

Post on 26-Apr-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

STAT 566 Fall 2013 Statistical Inference

Lecture Notes

Junfeng WenDepartment of Computing Science

University of [email protected]

Last update: January 6, 2016

Contents

1 Lecture 1: Introduction 21.1 Incomplete Repertory of Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Decision-theoretic Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Lecture 2: Evaluation of Statistical Procedures I 22.1 How to compare δ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Comparing risk function I: Bayes risk . . . . . . . . . . . . . . . . . . . . . . 32.3 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 Bayes risk revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Lecture 3: Location Estimation, Bayes Rules for Parametric Models 43.1 Prerequisites & Bayes risk revisited . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Estimation in parametric models . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Lecture 4: Evaluation of Statistical Procedures II 54.1 Comparing risk function II: minimax . . . . . . . . . . . . . . . . . . . . . . . 54.2 Connection between minimax and Bayes . . . . . . . . . . . . . . . . . . . . . 54.3 Admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.4 Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Lecture 5: Building Statistical Procedure I 75.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.2 Complete statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.3 Cramer-Rao bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

6 Lecture 6: Building Statistical Procedure II 136.1 Substitution principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.3 Asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.4 Maximum likelihood estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7 Lecture 7: Estimating the precision of estimates 167.1 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167.2 Delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8 Lecture 8: Confidence interval 188.1 Bayesian confidence/probability intervals . . . . . . . . . . . . . . . . . . . . . 188.2 General confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

9 Lecture 9: Hypothesis testing 209.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209.2 Testing evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209.3 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1

10 Lecture 10: Multiple testing 2110.1 Union-intersection test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2110.2 Intersection-union test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2110.3 Controlling family-wise error rate . . . . . . . . . . . . . . . . . . . . . . . . . 2110.4 Controlling false discovery rate . . . . . . . . . . . . . . . . . . . . . . . . . . 22

11 Lecture 11: Hypothesis testing, practical procedures 2211.1 Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2211.2 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2311.3 Rao score test via Lagrange multipliers . . . . . . . . . . . . . . . . . . . . . . 2311.4 Bayes factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1 Lecture 1: Introduction

1.1 Incomplete Repertory of Tasks

• Estimation. E.g. estimating someone’s weight.

• Testing. E.g. testing whether a treatment will work.

• Classification. E.g. email spam filter.

• Ranking...

1.2 Decision-theoretic Framework

• Data x P X , an outcome of random element X, a point in the sample space X .

• Action space A, the space of decisions

– For classification, A is finite with at least two elements.

– For testing, two possible elements: accept/reject.

• Decision rule δ, procedure, any (possibly randomized) function. δ : X ÞÑ A.

• Model P , from which X is drawn, an element of some collection of distributions P.

– Parametric model P “ tPθu, with θ in some space Θ (say Rn).

• Loss function lpδpxq, P q, the loss incurred when action a “ δpxq is chosen, and X isfrom P . Usually l ě 0.

2 Lecture 2: Evaluation of Statistical Procedures I

2.1 How to compare δ?

• If a “ δpxq is randomized, then first average the loss over all possible a:

slpδpxq, P q “ Eaplpδpxq, P qq.

• Compare based on risk:

rδpP q “ Ex„P rlpδpxq, P qs “

ż

lpδpxq, P qdP pxq.

– If δ is randomized, then replace l by sl first.

– It depends on P .

2

Estimation of the mean of normal.

X „ N pθ, 1q, to estimate θ with quadratic loss lppθ, θq “ ppθ ´ θq2. Givenan observation X, consider two estimators:

pδpXq “ pθpXq “ X; δpXq “ θpXq “ 0.

Their respective risks are given by

rpθpP q “ ErpX ´ θq2s “ V arpXq “ 1

rθpP q “ Erp0´ θq2s “ Epθ2q “ θ2

Therefore, none of pθpxq and θ is dominant, because the risks depend on θ,that is, P , the distribution of X.

2.2 Comparing risk function I: Bayes risk

• Prior distribution Π of P over distribution space P.

– In parametric cases, prior Π of θ over its space Θ.

– Bayes inference, given X, we can update our belief on P

PrpP |Xq “PrpX|P qPrpP q

PrpXq9PrpX|P qΠpP q

• Bayes risk is defined by

RΠδ “ EP„ΠrrδpP qs “

ż

rδpP qdΠpP q.

• It only depends on the decision rule δ and the prior Π.

2.3 Bayes theorem

• Suppose that fU,V pu, vq is a joint density of random elements U and V . The (marginal)density of V is

fV pvq “

ż

fU,V pu, vqdu.

The conditional density of U given V is

fU |V pu|vq “fU,V pu, vq

fV pvq“

fU,V pu, vqş

fU,V pu, vqdu.

The Bayes theorem states

fV |U pv|uq “fU,V pu, vq

fU puq“

fU,V pu, vqş

fU,V pu, vqdv“

fU |V pu|vqfV pvqş

fU |V pu|vqfV pvqdv.

2.4 Bayes risk revisited

• Let the posterior risk be

RΠpδpXq|P q “ EP„ΠrlpδpXq, P q|Xs.

• Bayes risk can be computed via posterior distribution

RΠδ “ EP„ΠrrδpP qs RΠ

δ “

ż

rδpP qdΠpP q “

ż

rδppqfP ppqdp

“ EP„ΠrEX„P rlpδpXq, P qss “

żˆż

lpδpxq, P qfX|P px|pqdx

˙

fP ppqdp

“ EpX,P q„pP,ΠqrlpδpXq, P qs “

ż ż

lpδpxq, P qfX,P px, pqdxdp

“ EX„P rEP„ΠrlpδpXq, P q|Xqss “

żˆż

lpδpxq, P qfP |Xpp|xqdp

˙

fXpxqdx

“ EX„P rRΠpδpXq|P qs “

ż

RΠpδpXq|P qfXpxqdx.

This will be favourable when posterior distribution fP |Xpp, xq is easily accessible.

3

3 Lecture 3: Location Estimation, Bayes Rules for Para-metric Models

3.1 Prerequisites & Bayes risk revisited

• Assume that X „ Q and we are interested in estimating some characteristic quantityof the distribution Q, say θpQq, where θp¨q is a functional.

• Some characteristic quantities of the distribution Q

– Mean: θpQq “ş

xdQpxq. Not always exists (e.g. Cauchy).

– Median: θpQq satisfies

Pr pX ď θpQqq ě1

2,Pr pX ě θpQqq ě

1

2.

– Quantile: For τ P p0, 1q, θτ pQq satisfies

Pr pX ď θτ pQqq ě τ,Pr pX ě θτ pQqq ě 1´ τ.

When τ “ 12 , it is median.

• Evaluation of estimation quality: loss

– Quadratic loss: lp2qpa,Qq “ pa´θpQqq2, then rp2qδ pQq “ E

p2qX„QrpδpXq´θpQqq

2s.

– Absolute loss: lp1qpa,Qq “ |a´ θpQq|, then rp1qδ pQq “ E

p1qX„Qr|δpXq ´ θpQq|s.

– 0-1 loss: lp0qpa,Qq “ Ipa ‰ θpQqq, then

rp0qδ pQq “ E

p0qX„QrIpδpXq ‰ θpQqqs

“ PrpδpXq ‰ θpQqq ¨ IpδpXq ‰ θpQqq ` PrpδpXq “ θpQqq ¨ IpδpXq ‰ θpQqq

“ PrpδpXq ‰ θpQqq ¨ 1` PrpδpXq “ θpQqq ¨ 0

“ PrpδpXq ‰ θpQqq.

The second equation is because X can choose two types of values, those δpXq ‰θpQq and those δpXq “ θpQq

• Bayes approach

– Get the posterior distribution given X.

∗ Know the posterior distribution of Q, then compute θpQq.

∗ Or know the posterior distribution of θ directly.

– Observe the loss function in question and determine its corresponding character-istic value.

∗ For lp2q, the solution is the mean of posterior distribution.

∗ For lp1q, the solution is the median of posterior distribution.

∗ For lp0q, the solution is the mode of posterior distribution.

3.2 Estimation in parametric models

• Assume we have n independent variables Xipi “ 1, ¨ ¨ ¨ , nq jointly from distribution P ,which is determined by Q, the identical distribution of each individual Xi. Furtherassume that θpQq is one-to-one map.

• To estimate the quantity θpP q given data, we need its posterior distribution. (No lossfunction for the moment)

4

Posterior of normal.Xi „ N pµ, σq, to estimate µ given Xi “ xi. Assume normal prior for µ:µ „ N pµ0, σ0q. That is,

πpµq “1

σ0

?2π

exp

ˆ

´pµ´ µ0q

2

2σ20

˙

qpx|µq “1

σ?

2πexp

ˆ

´px´ µq2

2σ2

˙

.

The posterior of µ given data x1, ¨ ¨ ¨ , xn is

fpµq9nź

i“1

qpxi|µqπpµq

“ exp

ˆ

´

řni“1pxi ´ µq

2

2σ2´pµ´ µ0q

2

2σ20

˙

“ exp

ˆ

´1

2

ˆ

n

σ2`

1

σ20

˙

µ2 `

ˆ

nsx

σ2`µ0

σ20

˙

µ

˙

.

That is,

N

˜

sxσ2 `

µ0

nσ20

1σ2 `

1nσ2

0

,1

nσ2 `

1σ2

0

¸

.

In general, if a random variable X has a density of the form K exppax2`bx`cq,then

– a ă 0, otherwise the integral will not converge to 1;

– The density can be expressed as

K exppax2 ` bx` cq “ K exp

ˆ

c

b2

4a2

˙

exp

«

a

ˆ

ˆ

´b

2a

˙˙2ff

,

which is the density of a normal distribution with µ “ ´ b2a and σ2 “ ´ 1

2a ;

– c is free and K “ Kpa, b, cq is a normalizing positive constant.

4 Lecture 4: Evaluation of Statistical Procedures II

4.1 Comparing risk function II: minimax

• The minimax risk (“worst case”)

sRδ “ supPPP

rδpP q.

• Minimax rule is the rule that minimize minimax risk.

4.2 Connection between minimax and Bayes

• Suppose δΠ is the Bayes rule for some prior Π, i.e., RΠδΠ “ infδ R

Πδ and suppose that

for all P , rδΠpP q ď RΠδΠ , then δΠ is minimax (and Π is called a least favourable prior).

– Proof: If δΠ were not minimax, then there would exist δ such that

supPrδpP q ă sup

PrδΠpP q ď RΠ

δΠ .

As the average never exceeds sup, and the average of a constant is that constant,we would have a contradiction with the assumptions:

RΠδ “ EP„P rrδpP qs ď sup

PrδpP q ă sup

PrδΠpP q ď RΠ

δΠ .

• If δ is the Bayes rule with respect to some prior Π, and if it has constant risk, rδpP q “ cfor all P , then δ is minimax.

5

– In fact, rδΠpP q “ RΠδΠ “ c in such cases.

4.3 Admissibility

• δ is admissible if there is no δ such that rδpP q ď rδpP q with strict inequality ă atleast for one P .

• Connection to Bayes rule: if δΠ is the unique Bayes rule with respect to a prior Π,then δΠ is admissible.

– Proof: If not, there exists δ such that rδpP q ď rδΠpP q with strict inequality forsome P , which implies

RΠδ “ EP„ΠrrδpP qs ď EP„ΠrrδΠpP qs “ RΠ

δΠ .

Not necessarily the middle inequality is ă, despite the strict inequality for atleast one P (because difference at single point may not influence their integral;but if P is discrete or continuous, then strict inequality will hold); however, theresult is a contradiction with uniqueness.

• Connection to minimax rule: if δ has constant risk and is admissible, then it is mini-max.

– Proof: Let δc be an admissible rule with constant risk c, i.e., rδcpP q “ c,@P P P.Because δc is admissible, for any other rule δ, there exists P0 P P, such that

rδcpP0q ď rδpP0q.

Now we prove the claim by contradiction. Assume that δc is not minimax, thenthere exists δ, such that

supPPP

rδpP q ă supPPP

rδcpP q.

Since the supremum should be larger than or equal to the risk at any specific P ,we have

rδpP0q ď supPPP

rδpP q

To combine, we have

c “ rδcpP0q ď rδpP0q ď supPPP

rδpP q ă supPPP

rδcpP q “ c.

A contradiction. Therefore, the assumption is false. δc is minimax.

James-Stein estimator.

Xi „ N pθi, 1q, i “ 1, ¨ ¨ ¨ , p, to estimate θi with quadratic loss lppθ, θq “ ppθ´ θq2.The natural estimator is θi “ Xi. It is admissible for p “ 1, 2, but not forp ą 3; in that case, the James-Stein estimator

pθJSi “

ˆ

1´p´ 2

řpi“1X

2i

˙`

Xi

has smaller risk. However, James-Stein estimator is not admissible either.

4.4 Unbiasedness

• δ is unbiased with respect to a loss l if for every P

rδpP q “ EX„P rlpδpXq, P qs ď EX„P rlpδpXq, Qqs, for all Q.

That is, EX„P rlpδpXq, Qqs is minimized at Q˚ “ P .

• When P is parametrized, δ is unbiased with respect to a loss l if for every θ

EX„Pθ rlpδpXq, θqs ď EX„Pθ rlpδpXq, θqs, for all θ.

6

Unbiasedness for quadratic loss.If l is quadratic loss, then unbiasedness means

θ “ argminθ EX„Pθ rpδpXq ´ θq2s

“ argminθ θ2 ´ 2θEX„Pθ rδpXqs

“ EX„Pθ rδpXqs.

Bias-variance decomposition.If l is quadratic loss, then the risk is

rδpP q “ EX rpδpXq ´ θpP qq2s

“ ErpδpXq ´ EpδpXqq ` EpδpXqq ´ θpP qq2s

“ ErppδpXq ´ EpδpXqqq2s ` rEpδpXqq ´ θpP qs2

“ V arpδpXqq `Bias2pδpXqq

5 Lecture 5: Building Statistical Procedure I

5.1 Sufficient statistics

• A statistic is a function T : X ÞÑ R.

• A statistic T is called sufficient for the model P, if the conditional distribution of thedata X given the value of T pXq “ T pxq does not depend on P.

Sufficient statistic for binomial distribution.Suppose X P t0, 1un with i.i.d. entries, where P pXi “ 1q “ p,@i. Then T pXq “XT1 is sufficient:

P pX|XT1 “ sq “P pX,XT1 “ sq

P pXT1 “ sq

$

&

%

psp1´pqn´s

pnsqpsp1´sqn´s

“`

ns

˘´1if XT1 “ s

0 if XT1 ‰ s

which does not depend on p.

• If T p¨q is a sufficient statistic for P and S is a one-to-one function, then SpT p¨qq is alsoa sufficient statistic for P.

• A sufficient statistic which is a function of every other sufficient statistic is calledminimal sufficient.

– May not exist.

– In the binomial example, T pXq “ X is not, but T pXq “ XT1 is.

• Let Π be a prior distribution on P. A statistic T p¨q is called Bayes sufficient for Π,if the posterior distribution of P given X “ x is the same as the posterior distributionof P given T pXq “ T pxq, for all x.

• (Kolmogorov) If T pXq is sufficient for P, it is Bayes sufficient for every Π.

– The converse is also true, but not in general.

• (Rao-Blackwell) Construct decision rule from sufficient statistics. Suppose thatthe loss function is convex for fixed P :

lpα1a1 ` α2a2, P q ď α1lpa1, P q ` α2lpa2, P q

where α1, α2 ě 0 and α1 ` α2 “ 1. If T pXq is sufficient for P and δ is a decision rule,then the decision rule δ˚pXq “ EδpXqpδpXq|T pXqq has uniformly smaller risk:

rδ˚pP q ď rδpP q,@P.

7

Also, if δ is unbiased, so is δ˚.

Rao-Blackwell for binomial distribution.Suppose X P t0, 1un with independent and entries, where P pXi “ 1q “ p,@i.Then T pXq “ XT1 is sufficient for p. Consider δpXq “ X1 (estimating p,unbiased). Then

δ˚pXq “ EδpXqrδpXq|T pXq “ T pxqs

“ EX1pX1|X

T1 “ sq

“ 0 ¨ P pX1 “ 0|XT1 “ sq ` 1 ¨ P pX1|XT1 “ sq

“ P pX1|XT1 “ sq

“P pX1 “ 1, XT1 “ sq

P pXT1 “ sq

“p ¨

`

n´1s´1

˘

ps´1p1´ pqpn´1q´ps´1q

`

ns

˘

psp1´ pqn´s

“s

n“XT1

n.

It is unbiased with respect to quadratic loss:

EX„P

ˆ

XT1

n

˙

“1

n

nÿ

i“1

EpXiq “ p.

Thus, its risk is the variance (see bias-variance decomposition):

rδ˚pP q “ V ar

ˆ

XT1

n

˙

“1

n2

nÿ

i“1

V arpXiq “1

n2¨ n ¨ pp1´ pq “

pp1´ pq

n.

It has uniformly smaller risk than δpXq for any p:

rδ˚pP q “ EX„P

l

ˆ

XT1

n, p

˙

ď EX„P rlpX1, pqs

“ p ¨ p1´ pq2 ` p1´ pq ¨ p0´ pq2

“ pp1´ pq.

• (Neyman-Savage) Factorization criterion for sufficient statistics. Suppose that Xhas a density (or mass). T is sufficient for θ iff there are g and h such that

fpx|θq “ gpT pxq, θqhpxq.

– T is sufficient for θ if and only if the following is true:

T pxq “ T pyq ñ fpx|θq “ cpx, yqfpy|θq.

– T is minimal sufficient for θ if and only if the following is true:

T pxq “ T pyq ô fpx|θq “ cpx, yqfpy|θq.

8

Neyman-Savage factorization criterion for binomial distri-bution.Suppose X P t0, 1un with independent and entries, where P pXi “ 1q “ p,@i.Then T pXq “ XT1 is sufficient for p, since if T pXq “ T pxq “ s,

fpx|pq “ psp1´ pqn´s “ gps, pqhpxq,

wheregps, pq “ psp1´ pqn´s;hpxq “ 1.

T is minimal. Let

fpx|pq “ př

i xip1´ pqn´ř

i xi ; fpy|pq “ př

i yip1´ pqn´ř

i yi .

T is minimal because

T pxq “ T pyq “ sô fpx|pq “ cpx, yqfpy|pq,

where cpx, yq “ 1.

Neyman-Savage factorization criterion for normal distribu-tion.Suppose Xi „ N pµ, σ2q, i “ 1, ¨ ¨ ¨ , n, σ2 is known, to estimate µ. Then T pXq “XT1 is sufficient for µ, since if T pXq “ T pxq “ s,

fpx|pq “1

σnp2πqn{2exp

˜

´1

2σ2

nÿ

i“1

pxi ´ µq2

¸

“1

σnp2πqn{2exp

˜

´1

2σ2

nÿ

i“1

x2i

¸

¨ exp

˜

1

2σ2p2µ

nÿ

i“1

xi ´ nµ2q

¸

“ hpxqgps, pq.

where

gps, pq “ exp

ˆ

1

2σ2p2µs´ nµ2q

˙

hpxq “1

σnp2πqn{2exp

˜

´1

2σ2

nÿ

i“1

x2i

¸

.

T is minimal. To see this, set cpx, yq “ exp`

´ 12σ2

ř

ipx2i ´ y

2i q˘

.

5.2 Complete statistics

• Assume a parametric model tPθu and the quadratic loss function.

• A statistic S is complete if for every function g, independent of θ,

EX„Pθ rgpSpXqqs “ 0,@θ ñ PrX„Pθ

rgpSpXqq “ 0s “ 1,@θ.

Roughly speaking, if the expectation with respect to all θ is 0, then g is identicallyzero.

Complete statistic for binomial distribution.Suppose X P t0, 1un with independent and entries, where P pXi “ 1q “ p,@i.Then T pXq “ XT1 is complete for p: if

ErgpT pXqqs “nÿ

k“0

gpkqPrpT pXq “ kq “nÿ

k“0

gpkq

ˆ

n

k

˙

pkp1´ pqn´k,

equals to zero for all p P r0, 1s, then gpkq “ 0 for all k, because ErgpT pXqqs isa polynomial of p.

9

• (Lehmann-Scheffe) Any unbiased estimator based (only) on a complete, sufficientstatistic is minimum-variance unbiased estimator. That is, it has the smallestvariance (= MSE for unbiased), for all θ, among all unbiased estimators of θ.

5.3 Cramer-Rao bound

• In this part we only consider regular models, whose support (tx|fpx; θq ą 0u) does notdepend on θ. Also assumed is that we may interchange integration and differentiation.

• The score function is

spx; θq “B

Bθlog fpx; θq “

BBθfpx; θq

fpx; θq.

– Note that

EX„Pθ rspX; θqs “

ż BBθfpx; θq

fpx; θqfpx; θqdx

ż

B

Bθfpx; θqdx

“B

ż

fpx; θqdx

“B

Bθ1 “ 0.

(5.1)

– When X consists of independent r.v.s then

spx; θq “B

Bθlog fpx; θq “

B

nÿ

i“1

log gpxi; θq “nÿ

i“1

B

Bθlog gpxi; θq

• The Fisher information is defined by

Ipθq “ V arX„Pθ rspX; θqs

“ EX„Pθ rs2pX; θqs

ż

˜

BBθfpx; θq

fpx; θq

¸2

fpx; θqdx

ż

p BBθfpx; θqq2

fpx; θqdx

– Another way to compute Ipθq via second derivative. First note that

B2

B2θlog f “

B

f 1

f“f2f ´ f 1f 1

f2.

Also,

EX„Pθ

ˆ

f2

f

˙

ż

f2

ffdx “

ż

f2dx “ 0 as

ż

fdx “ 1.

To combine, we have

E

ˆ

´B2

B2θlog f

˙

“ E

ˆ

f 1f 1 ´ f2f

f2

˙

“ E

˜

ˆ

f 1

f

˙2¸

´ E

ˆ

f2

f

˙

“ Eps2q ´ 0 “ Ipθq

– When X consists of independent r.v.s then

Ipθq “ V arX„Pθ rspX; θqs

“ V arX„Pθ rnÿ

i“1

B

Bθlog gpxi; θqs

nÿ

i“1

V arX„Pθ rB

Bθlog gpxi; θqs

nÿ

i“1

ż

p BBθgpxi; θqq

2

gpxi; θqdxi

10

If all gi are identical, then

Ipθq “ n

ż

p BBθgpy; θqq2

gpy; θqdy

• Cramer-Rao inequality provides a lower bound on the variance of any statisticUpXq. Consider the covariance of spX; θq and UpXq. By Cauchy-Schwartz inequality

rCovX„Pθ pspX; θq, UpXqqs2 ď V arX„Pθ pspX; θqq ¨ V arX„Pθ pUpXqq

“ Ipθq ¨ V arX„Pθ pUpXqq

To compute CovX„Pθ pspX; θq, UpXqq (note Eq.(5.1)):

CovX„Pθ pspX; θq, UpXqq “ EX„Pθ rspX; θqUpXqs ´ EX„Pθ pspX; θqqEX„Pθ rUpXqs

“ EX„Pθ rspX; θqUpXqs

ż BBθfpx; θq

fpx; θqUpxqfpx; θqdx

ż

B

Bθfpx; θqUpxqdx

“B

ż

Upxqfpx; θqdx “B

BθEX„Pθ rUpXqs.

Therefore, Cramer-Rao lower bound gives

V arX„Pθ pUpXqq ět BBθEX„Pθ rUpXqsu

2

Ipθq

– When UpXq is unbiased w.r.t. quadratic loss:

EX„Pθ pUpXqq “ θ.

Because BBθEX„Pθ rUpXqs “

BBθ θ “ 1, we have

V arX„Pθ pUpXqq ě1

Ipθq.

Cramer-Rao lower bound for binomial distribution.Suppose X P t0, 1un with independent and entries, where P pXi “ 1q “ p,@i.gpxi; pq “ pxip1´ pq1´x1 . To compute the Fisher information

Ippq “ nˆ EXi

«

B

Bθlog gpxi; pq

2ff

“ n ¨ EXi

«

ˆ

xip´

1´ xi1´ p

˙2ff

“n

p2p1´ pq2EXirpxi ´ pq

2s

“n

p2p1´ pq2V arXipxiq “

n

pp1´ pq.

Therefore, any unbiased estimator of p must have a variance (which equals its

mean square error) greater than 1Ippq “

pp1´pqn .

Now let’s compute the variance (also MSE) of an unbiased estimator T pXq “XT 1n :

V arXpT pXqq “1

n2

nÿ

i“1

V arXipXiq “pp1´ pq

n.

Therefore, this estimator makes the Cramer-Rao lower bound tight.

• When Cramer-Rao an equality? The inequality is the result of Cauchy-Schwartz.Therefore, if the score function has the form

spx; θq “B

Bθlog fpx; θq “ cpθq ` dpθqUpxq,

11

equality will hold. Then

fpx; θq “ exppηpθqUpXq ´ apθq ` gpxqq.

That is, exponential family preserves equality in Cramer-Rao. For the exponentialfamily, if we define η “ ηpθq as a new parameter, then apθq “ bpηq. For the density,we have

ż

fpx; ηqdx “

ż

exppηUpXq ´ bpηq ` gpxqqdx “ 1.

Differentiating in η on both sides (assuming we can interchange integration and dif-ferentiation), we have

0 “

ż

exppηUpXq ´ bpηq ` gpxqqpUpxq ´ b1pηqqdx

ż

Upxq exppηUpXq ´ bpηq ` gpxqqdx´ b1pηq exppηUpXq ´ bpηq ` gpxqqdx

“ EXpUpxqq ´ b1pηq.

Therefore, we have EXpUpxqq “ b1pηq. Similarly, we have V arXpUpXqq “ b2pηq.

Cramer-Rao for exponential distribution.Exponential distribution is specified by

fpx;λq “

#

λe´λx when x ě 0

0 otherwise

where λ ą 0 is the parameter. Note that it can be also expressed as

fpx; θq “

#

1θ e´ 1θ x when x ě 0

0 otherwise

for θ ą 0. With this new parametrization, we can see that UpXq “ X isunbiased for θ, with least variance (MSE) one can have because of Cramer-Raoinequality.Note that if we use old parametrization with λ, then UpXq “ ´X and

λe´λx “ e´λx´p´ lnλq.

Thus b1pλq “ ´1λ , so is EXpUpXqq “ EXp´Xq “ ´EXpXq “

´1λ .

(Side note: having an estimator, pθ of θ, with some properties does not mean

that gppθq is the estimator of gpθq with the same properties (unless g is verysimple - say, a linear function). For instance, an unbiased estimator for σ2 maynot be unbiased for σ)

Cramer-Rao for binomial distribution: revisited.Suppose X P t0, 1un with independent and entries, where P pXi “ 1q “ p,@i.gpxi; pq “ pxip1´ pq1´x1 . The joint distribution is

fpx; pq “ př

i xip1´ pqn´ř

i xi

ˆ

p

1´ p

˙

ř

i xi

p1´ pqn

“ exp

˜

lnp

1´ p

ÿ

i

xi ` n lnp1´ pq

¸

.

Let η “ ln p1´p , we have

fpx; pq “ exp

˜

ηÿ

i

xi ´ n lnp1` eηq

¸

Therefore, it is also a member of exponential family. As a result, Cramer-Raobound is sharp (as we already shown).

12

6 Lecture 6: Building Statistical Procedure II

6.1 Substitution principle

• Let x “ px1, x2, ¨ ¨ ¨ , xnq, which is a realization of X “ pX1, X2, ¨ ¨ ¨ , Xnq, where Xi is ar.v. with distribution P . We want to estimate θpP q, some characteristic quantity of P .Typically, Xi are independent, but it is not absolutely necessary; some permutationalinvariance (exchangeability) is enough.

• An empirical distribution by x is the discrete distribution that assigns probability1n to every point xi, denoted Px:

PxpEq “1

ncardti|xi P Eu.

• The substitution principle states that to estimate θpP q, replace P by Px.

Moment estimation.Suppose we want to estimate the kth moment

θpP q “

ż

zkdP pzq, k “ 1, 2, ¨ ¨ ¨ .

The resultant estimators with substitution principle is

θpPxq “ż

zkdPxpzq “1

n

nÿ

i“1

xki .

Variance estimation.Suppose we want to estimate the variance

θpP q “

żˆ

z ´

ż

udP puq

˙2

dP pzq.

The resultant estimator with substitution principle is

θpPxq “1

n

nÿ

i“1

˜

xi ´1

n

nÿ

i“1

xi

¸2

“1

n

nÿ

i“1

pxi ´ sxq2.

.

13

Linear regression.Consider the linear regression model:

Y “ αX ` β ` U, (6.1)

where X is input variable and Y is output variable (jointly with U from somedistribution), α, β are the parameters of the model, and U is an error termindependent of X (so they are also uncorrelated, EpXUq “ 0) with EpUq “ 0.Taking expectation on both sides, we have

EpY q “ αEpXq ` β. (6.2)

Moreover, we can multiply Eq.(6.1) by X, then take expectation:

EpXY q “ αEpX2q ` βEpXq. (6.3)

With Eq.(6.2) and Eq.(6.3), we can solve α, β as

α “EpXY q ´ EpXqEpY q

EpX2q ´ pEpXqq2“CovpX,Y q

V arpXq

β “ EpY q ´ αEpXq.

By substitution principle, all expectations (variance/covariance) can be com-puted from sample tpx1, y1q, px2, y2q, ¨ ¨ ¨ , pxn, ynqu:

pα “

řni“1pxi ´ sxqpyi ´ syqřni“1pxi ´ sxq2

pβ “ sy ´ pαsx.

Quantile estimation.For τ P p0, 1q, suppose we are going to estimate the quantile qτ such that

P pp´8, qτ sq ě τ ;P prqτ ,`8qq ě 1´ τ.

We can see quantile in a different way. Define “check function” as

ρτ pzq “ |z| ` p2τ ´ 1qz “

#

2τz for z ą 0

2pτ ´ 1qz for z ď 0

Then

qτ “ argmincErρτ pZ ´ cqs “

ż

ρτ pz ´ cqdP pzq.

Therefore, given τ P p0, 1q, to estimate qτ , find the minimizer c˚ of

ż

ρτ pz ´ cqdPxpzq “1

n

nÿ

i“1

ρτ pxi ´ cq.

For instance, if τ “ 0.5, we are trying to estimate the median, then

cardti|xi ď c˚u ěn

2; cardti|xi ě c˚u ě

n

2.

That is, c˚ is just the sample median.

• Can be used to estimate non-parametric model. If we want to estimate the accumula-tive distribution F pzq “ P pp´8, zsq, by substitution principle, we have

Fnpzq “ Pxpzq “1

ncardti|xi ď zu,

which is essentially a step function.

– EpFnpzqq “ F pzq.

– V arpFnpzqq “ F pzqp1´F pzqqn .

14

– Fpzq Ñ F pzq as nÑ8 (in probability, almost surely).

– supz |Fpzq ´ F pzq| Ñ 0 as nÑ8 (in probability, almost surely).

6.2 Consistency

• If the estimator pθn converges to the target/estimated quantity θ as n Ñ 8, whereconvergence is determined by

– Convergence in probability

Prp|pθ ´ θ| ě εq Ñ 0 for every ε ą 0.

– Almost surely (with probability 1) convergence

Prp|pθ ´ θ| Ñ 0q “ 1.

– Convergence in some mean sense

Ep|pθ ´ θ|pq Ñ 0.

then we say pθn is consistent.

Consistent estimator for mean and variance.Assume that Xi are iid r.v.s and the mean µ “ EpXiq exists. Then sXn isa consistent estimator for µ. By a law of large numbers (have differentversions),

sXnpÑ µ

as nÑ8.Now further assume that the variance σ2 “ V arpXiq exists. Consider thequantity

1?n

nÿ

i“1

pXi ´ µq “?np sXn ´ µq.

A central limit theorem (again, many versions) states that

sXn ´ Ep sXnqa

V arpXnq“

sXn ´ µb

σ2

n

“?nsXn ´ µ

σ

converges in distribution to the standard normal distribution N p0, 1q. It followsthat

?np sXn ´ µq converges in distribution to N p0, σ2q.

6.3 Asymptotic normality

• If the estimator, pθn, of θ, has the property that?nppθn ´ θq converges in distribution

to N p0, σ2q, then we call that estimator asymptotically normal with asymptoticvariance σ2.

– The smaller σ2 is, the better (more accurate).

• To compare two asymptotically normal estimator pθ and rθ, with

?nppθ ´ θq

dÑ Z „ N p0, σ2q;

?nprθ ´ θq

dÑ Z „ N p0, rσ2q.

The asymptotic relative efficiency (ARE) of rθ to pθ is defined as

AREprθ, pθq “σ2

rσ2.

15

ARE of sample mean versus sample median.Assume that Xi are iid r.v.s whose mean and median are both µ.

– Suppose that the variance of Xi is σ2; from the central limit theorem, weknow that σ2 is the asymptotic variance of the sample mean.

– Suppose that the common density, f , of Xi exists and is positive at µ. Forthe sample median, Kolmogorov proved that under these assumptions, itis asymptotically normal with the asymptotic variance 1

4pfpµqq2 .

For instance, if the distribution of Xi is normal, then the asymptotic varianceof sample mean is σ2, while the asymptotic variance of sample median is

1

4pfpµqq2“π

2σ2.

Therefore,

AREpµmedian, µmeanq “σ2

π2σ

2“

2

π« 0.6366,

which means sample mean µmean is more efficient.For any unimodal f (only has one mode), the ratio is ě 1{3 and there are fwith ą 1, i.e., the sample median is more efficient (t distribution with 3 or 4degrees of freedom, for instances).

6.4 Maximum likelihood estimate

• LikelihoodLpθq “ fpx; θq.

If we have independent r.v.s, then

Lpθq “ fpx; θq “nź

i“1

gpxi; θq.

• Maximum likelihood estimate is given by

pθ “ argmaxθ Lpθq.

Usually take the logarithm when we have independent r.v.s.

• Suppose that pθn are maximum likelihood estimators of θ, from iid sample where thedistribution of Xi is specified by θ. Then typically,

– Maximum likelihood estimators are consistent (in probability): pθnpÑ θ.

– They are asymptotically normal, and asymptotically efficient:

?nppθn ´ θq

dÑ Z „ N

ˆ

0,1

Ipθq

˙

,

where Ipθq is the Fisher information for one observation from the family parametrizedby θ. (Note that the Fisher information for the whole sample X1, X2, ¨ ¨ ¨ , Xn isnIpθq.)

7 Lecture 7: Estimating the precision of estimates

7.1 Bootstrap

• We care about how accurate our prediction is.

16

Standard error: a canonical example.Consider an example where X1, ¨ ¨ ¨ , Xn are i.i.d. r.v.s with the same distribu-tion P with mean µ and variance σ2. We are estimating the mean µ by sX. It

is known that V arp sXq “ σ2

n . So the standard error of sample mean is given by

seX„P p sXq “σ?n.

However, we don’t know σ, so we can estimate this by substitution principle,we have

seX„PxpsXq “

1?n

g

f

f

e

1

n

nÿ

i“1

pXi ´ sXq2.

(Sometimes, n´ 1 is preferable.)

Unlike the example above, standard error (standard deviation) of estimator pθ may notbe calculated in closed form. So we can estimate its standard error via bootstrap.

– Generate B bootstrap samples of size n (sampling from Px with replacement.)

– We estimate sePxppθq by the standard derivative of pθ˚, the estimation from boot-

strap sample:

sePxppθq «

g

f

f

e

1

B

Bÿ

i“1

ppθ˚b ´1

B

Bÿ

i“1

pθ˚b q2. (7.1)

– In theory, there are`

2n´1n

˘

distinct bootstrap samples of size n (place n´1 boardsin between n balls). However, their probabilities are different. The probabilityof a bootstrap sample in which xi appears ki times, with ki ą 0 and k1 ` k2 `

¨ ¨ ¨ ` kn “ n isn!

nnk1!k2! ¨ ¨ ¨ kn!.

The most probable sample is the one with ki “ 1 - the original one.

• Bias correction via bootstrap.

– The bias of pθ isbX„P ppθq “ EX„P ppθq ´ θ.

If it is known, then we can use

rθ “ pθ ´ bX„P ppθq

as a “corrected” estimate: EX„P prθq “ EX„P ppθq ´ EX„P pbX„P ppθqq “ rθ `

bX„P ppθqs ´ bX„P ppθq “ θ.

– When bX„P ppθq is unknown, we estimate it by

bX„Pxppθq “ EX„Pxp

pθq ´ pθ,

where EX„Pxppθq can be estimated by bootstrap 1

B

řBi“1 θ

˚b . That is, we can

correct bias byrθ “ 2pθ ´ EX„Pxp

pθq.

• Parametric bootstrap

– Non-parametric bootstrap is to substitute P by Px.

– Parametric bootstrap assumes that the distribution P comes from a model tPθuθPΘ,and substitutes P

pθ for P . In Monte Carlo approximation, it means that we donot draw random samples from Px, but from Pθ instead.

17

7.2 Delta method

• Suppose we have an asymptotic normality theorem for pθ “ pθn (for example, CLT withpθ “ sX):

?nppθn ´ θq

LÑ Np0, σ2q,

then we have

pθn¨„ Npθ,

σ2

nq.

where¨„ is “approximately distributed as”. σ can be known or estimated, then we can

use σ{?n as standard error of pθn.

• Sometimes we care about gpθq instead of θ itself. Then we may estimate gpθq by gppθq(MLE works for instance). If g is differentiable (which implies continuous) at θ andg1pθq ‰ 0, then

?ngppθnq ´ gpθq

σ|g1pθq|LÑ Np0, 1q,

and then

gppθnq¨„ N

ˆ

gpθq,σ2pg1pθqq2

n

˙

,

which implies the standard error of gppθnq is σg1pθq?n

.

• θ is unknown, so g1pθq is also unknown. If pθn is consistent (pÑ θ), and g1 is continuous

(at θ), then we have (by Slutsky’s Theorem)

?ngppθnq ´ gpθq

σ|g1ppθq|

LÑ Np0, 1q,

and then

gppθnq¨„ N

˜

gpθq,σ2pg1ppθqq2

n

¸

.

8 Lecture 8: Confidence interval

8.1 Bayesian confidence/probability intervals

• Bayesian approach: everything is in posterior distribution.

• Percentile method.

– Take two quantiles, qβ and q1´γ , set β, γ such that

Prpqβ ď θ ď q1´γq “ 1´ α.

Usually, β “ γ “ α{2.

– HPD (highest posterior density). With posterior density fθ|xpuq, find c suchthat the region is E “ tu|fθ|xpuq ě cu, where

Prθ|xpEq “

ż

E

fθ|xpuqdu “ 1´ α (or ě 1´ α if discrete).

It is the shortest interval if fθ|xpuq is unimodal.

8.2 General confidence intervals

• Main idea: find the distribution of the estimates.

18

Normal observations: unknown µ, known σ.Consider an example where X1, ¨ ¨ ¨ , Xn are i.i.d. r.v.s with Npµ, σ2q. We are

estimating the mean µ by sX. We know that sX „ N´

µ, σ2

n

¯

, soĎX´µσ{?n„ Np0, 1q.

Then

1´ α “ Pr

„ˇ

ˇ

ˇ

ˇ

sX ´ µ

σ{?n

ˇ

ˇ

ˇ

ˇ

ď zα{2

“ Pr

sX ´σ?nzα{2 ď µ ď sX `

σ?nzα{2

,

where zα{2 is the α{2-quantile of standard normal Np0, 1q.

Generally, if pθ is (approximately) Npθ, pseppθqq2q, then

Pr”

pθ ´ seppθqzα{2 ď θ ď pθ ` seppθqzα{2

ı

“ 1´ α.

– Bootstrap confidence intervals (normal case). If seppθq is unknown (σ is un-

known), then we can estimate it via bootstrap (7.1). This works if pθ is (approxi-mately) normal.

– Bootstrap “percentile” confidence intervals (normal case). We can estimate

the end points pθ ˘ seppθqzα{2 directly by bootstrap estimates pθ˚α{2,pθ˚1´α{2. Recall

that we have B bootstrap sample estimates pθ˚. pθ˚α{2 corresponds to the α{2

sample quantile of these B estimates.

– Bootstrap pivotal confidence intervals. We can estimate the α{2 and 1´α{2

quantiles (qα{2 and q1´α{2) of pθ ´ θ, by pθ˚α{2 ´pθ, pθ˚1´α{2 ´

pθ. Then

1´ α “ Prrqα{2 ď pθ ´ θ ď q1´α{2s

“ Prrpθ ´ q1´α{2 ď θ ď pθ ´ qα{2s

« Prrpθ ´ ppθ˚1´α{2 ´pθq ď θ ď pθ ´ ppθ˚α{2 ´

pθqs

“ Prr2pθ ´ pθ˚1´α{2 ď θ ď 2pθ ´ pθ˚α{2s

Normal observations: unknown µ, unknown σ.Consider an example where X1, ¨ ¨ ¨ , Xn are i.i.d. r.v.s with Npµ, σ2q. We areestimating the mean µ by sX. Let

s2 “

g

f

f

e

1

n´ 1

nÿ

i“1

pXi ´ sXq2.

We know that

Z “sX ´ µ

σ{?n„ Np0, 1q

χ2 “pn´ 1qs2

σ2„ χ2pn´ 1q

t “Z

a

χ2{pn´ 1q“

ĎX´µσ{?n

b

pn´1qs2

σ2pn´1q

“?nsX ´ µ

s„

Np0, 1qb

χ2pn´1qn´1

“ tpn´ 1q,

since Z and χ2 are independent. Then

1´ α “ Pr“

|t| ď tα{2pn´ 1q‰

“ Pr

sX ´s?ntα{2pn´ 1q ď µ ď sX `

s?ntα{2pn´ 1q

,

where tα{2pn ´ 1q is the α{2-quantile of tpn ´ 1q, t distribution with pn ´ 1qdegree of freedom.

19

9 Lecture 9: Hypothesis testing

9.1 Setup

• Null hypothesis set P0; alternative hypothesis set PA. (Θ0 and ΘA if parametric).

• P0

Ş

P0 “ H; P0

Ť

P0 “ P.

• Rejection region R Ď X : if data X falls into R, then reject null hypothesis; acceptnull hypothesis if X P X zR.

• Errors

Table 1: Testing errorsDecision

Accept H0 Reject H0

TruthH0 Correct Type I ErrorHA Type II Error Correct

9.2 Testing evaluation

• Power function, level, size

– The power function is defined as

βpP q “ PrpX P Rq,

where X „ P and R is calculated based on the testing method. Note that

βpP q “

#

PrpType I errorq if P P P0

1´ PrpType II errorq if P P PA

We say that a test is powerful if βpP q is “large” for P P PA.

– Given 0 ď α ď 1, a test is of (significance) level α if supPPP0βpP q ď α.

– Given 0 ď α ď 1, a test is of size α if supPPP0βpP q “ α.

• Most powerful test. A test at level α that has higher or equal power than all othertests at level α for all P P PA is called uniformly most powerful at level α.

• Neyman-Pearson lemma. To test one simple hypothesis P0 against one simplealternative hypothesis PA. Assuming they can be represented by density f0pxq, fApxq,respectively. On the basis of observed x, the (uniformly) most powerful test exists andis

reject H0 iffApxq

f0pxqě c,

where c is set so that

P0 rx P Rs “ P0

fApxq

f0pxqě c

“ α.

– Randomized version:

reject H0 iffApxq

f0pxqą c,

reject H0 with probability d iffApxq

f0pxq“ c,

accept H0 iffApxq

f0pxqă c,

where c and d P r0, 1s are set so that

P0 rx P Rs “ P0

fApxq

f0pxqą c

` P0

fApxq

f0pxq“ c

¨ d “ α.

20

Normal observations: most powerful test.Consider an example where X1, ¨ ¨ ¨ , Xn are i.i.d. r.v.s with Npµ, σ2q, where σis known. We are testing H0 : µ “ µ0 against HA : µ “ µA. The most powerfultest has rejection region

1σnp2πqn{2

e´1

2σ2

ř

pXi´µAq2

1σnp2πqn{2

e´1

2σ2

ř

pXi´µ0q2“ exp

´

ÿ

pXi ´ µ0q2 ´

ÿ

pXi ´ µAq2¯

ě c,

which is equivalent to

sX “1

n

ÿ

Xi ě k if µA ą µ0; sX “1

n

ÿ

Xi ď k if µA ă µ0.

The constant k is chosen so that P0p sX ě kq “ α. We know that under H0

(so that we compute P0 based on µ0), sX „ Npµ0, σ2{nq. That is,

?n

ĎX´µ0

σ „

Np0, 1q. Therefore, k “ µ0 `σ?nzα.

The result here can be extended to composite testing where H0 : µ ď µ0 andHA : µ ą µA.

• Unbiased test. A test with power function βpP q is unbiased if βpPAq ě βpP0q forevery PA P PA, P0 P P0.

9.3 p-value

• Suppose we have nested rejection regions Rα1Ď Rα2

whenever α1 ď α2. Given theobserved data x, the observed significance level (or p-value) is defined as

ppxq “ inftα|x P Rαu.

10 Lecture 10: Multiple testing

Suppose we have K tests, k “ 1, 2, ¨ ¨ ¨ ,K: testing H0k : P0k against HAk : PAk withrejection region Rk.

10.1 Union-intersection test

• Testing H0 : P0 “Ş

k P0k against HA : PA “ pŞ

k P0kqc“Ť

k PAk.

• Rejection region is R “Ť

kRk.

• Union bound: P0pX P Rq “ P0pX PŤ

kRkq ďř

k P0pX P Rkq.

10.2 Intersection-union test

• Testing H0 : P0 “Ť

k P0k against HA : PA “ pŤ

k P0kqc“Ş

k PAk.

• Rejection region is R “Ş

kRk.

10.3 Controlling family-wise error rate

Family-wise error rate (FWER): the probability of committing at least one error of thefirst kind. We want to bound it as

FWER “ P0

˜

X P

k“1

Rk

¸

ď α.

• Bonferroni method: reject all null hypotheses whose p-value pk is smaller than α{K.By union bound,

FWER “ P0pX P Rq “ P0

˜

X P

k“1

Rk

¸

ď

Kÿ

k“1

P0pX P RKk“1q ď

Kÿ

k“1

α

K“ α.

• Holm method.

21

– Order p-values as pp1q ď pp2q ď ¨ ¨ ¨ ď ppKq.

– If αK ď pp1q, then accept all null hypotheses and stop;

otherwise reject H0p1q and continue.

– If αK´1 ď pp2q, then accept all remaining null hypotheses and stop;

otherwise reject H0p2q and continue.

– ¨ ¨ ¨

– If α1 ď ppKq, then accept H0pKq and stop;

otherwise reject H0pKq and stop.

10.4 Controlling false discovery rate

Rejecting null hypothesis when it is true means “false discovery” (Type I error).

• False discovery proportion (FDP) is defined as

FDP “# of false discoveries

# of all discoveries,

where the # is counted from K tests.

• False discovery rate (FDR) is defined as the expectation of FDP, i.e.,

FDR “ EpFDPq.

We want to control FDR as FDRď α.

• Benjamini and Hochberg method.

– Order p-values as pp1q ď pp2q ď ¨ ¨ ¨ ď ppKq.

– Let li “iα

KCK, where

CK “

#

1 if tests are independentřKi“1

1i otherwise

Let r “ maxti|ppiq ă liu.

– Set t “ pprq as the Benjamini-Hochberg rejection threshold. Reject all null hy-potheses whose pk ď t.

11 Lecture 11: Hypothesis testing, practical procedures

11.1 Wald test

• pθ is an estimator of θ. To test H0 : θ “ θ0, against the alternative HA : θ ‰ θ0.pθ´θ

seppθ|θq

is a good indicator of discrepancy.

• Suppose pθ is (approximately) normal:

?nppθ ´ θq

LÑ Np0, σ2q ùñ pθ

¨„ Npθ, pseppθ|θqq2q ùñ

pθ ´ θ

seppθ|θq

¨„ Np0, 1q.

• If seppθ|θq is unknown (because θ is unknown), then we can estimate it by

seppθ|θq « seppθ|pθq “

b

V arppθq, or σ2 « pσ2.

We reject H0 ifpθ´θ0?V arppθq

is too large or too small. Equivalently, reject H0 if

ppθ ´ θ0q2

V arppθq

¨„ χ2p1q

is too large.

22

• In multidimensional case, with (approximately) normality,

?nppθ ´ θq

LÑ Np0, V q,

where Vpˆp is variance matrix. Then the Wald test becomes reject H0 if

nppθ ´ θ0qTV ´1ppθ ´ θ0q

¨„ χ2ppq.

If V unknown, estimate it as pV “ V ppθq or pV “ V pθ0q.

• If pθ is an MLE of θ, thenV pθq “ I´1pθq,

where Ipθq is the Fisher information matrix for ONE observation.

– In one dimensional case, V arppθq “ 1nIpθq .

11.2 Likelihood ratio test

Consider parametric model and its hypotheses H0 : θ P Θ0 and HA : θ P ΘA.

• From Neyman-Pearson lemma, the optimal test is based on

fApxq

f0pxq“LpθAq

Lpθ0qě c.

Or equivalentlylogLpθAq ´ logLpθ0q “ lpθAq ´ lpθ0q ě c.

• To extend this to multiple hypotheses case, the likelihood ratio test statistic isdefined as: reject H0 if

supθPΘA Lpθq

supθPΘ0Lpθq

ě c.

Or alternatively, reject H0 ifsupθPΘ Lpθq

supθPΘ0Lpθq

ě c.

• Let pθ and pθ0 be unconstrained and constrained MLE, respectively. Then the test inlogarithm form: reject H0 if

2plppθq ´ lppθ0qq ě c,

where the 2 is to ensure that the statistic has the approximate distribution χ2ppq,where p is the number of restrictions imposed by the null hypothesis.

11.3 Rao score test via Lagrange multipliers

If the null hypothesis is interpreted as a restriction on parameters: H0 : gpθq “ 0, and thealternative is again HA : gpθq ‰ 0, then following the idea of Neyman-Pearson, we can checkthe magnitude of Lagrange multiplier as an indicator of how much the constraint is violated.

• Consider maximizing lpθq ´ λgpθq. Setting the derivative (in θ) to zero, we have

λpθq “l1pθq

g1pθq“f 1px; θq

fpx; θq

1

g1pθq.

We reject null ifˇ

ˇ

ˇ

λpθqsepλpθqq

ˇ

ˇ

ˇor λ2

pθqV arpλpθqq is too large.

• When gpθq “ θ ´ θ0, g1pθq “ 1, V arpλpθ0qq “ nIpθ0q. Then

λpθ0qa

nIpθ0q

¨„ Np0, 1q,

λ2pθ0q

nIpθ0q

¨„ χ2p1q.

Quantiles can be applied to find rejection region.

23

Score test for Binomial.X „ Binpn, pq. To compute the score function and Fisher information:

λppq “ sppq “nppp´ pq

pp1´ pq, nIppq “

n

pp1´ pq.

Therefore,

Z “pp´ p0

b

p0p1´p0q

n

¨„ Np0, 1q,

which is equivalent to Wald test with seppp|pq estimated as seppp|p0q instead of seppp|ppq.

Figure 1: Illustration of Wald, LRT and Rao tests.

11.4 Bayes factor

To interpret Neyman-Pearson in Bayesian formula, consider averaging instead of maximiza-tion: reject H0 if

ş

ΘALpθqπApθqdθ

ş

Θ0Lpθqπ0pθqdθ

ě c,

where πA and π0 are priors over ΘA and Θ0.

24