theory short 09

Upload: rheza-mp

Post on 16-Oct-2015

25 views

Category:

Documents


2 download

DESCRIPTION

statistical theory

TRANSCRIPT

  • Statistical Theory

    Prof. Gesine Reinert

    November 23, 2009

  • Aim: To review and extend the main ideas in Statistical Inference, bothfrom a frequentist viewpoint and from a Bayesian viewpoint. This courseserves not only as background to other courses, but also it will provide abasis for developing novel inference methods when faced with a new situationwhich includes uncertainty. Inference here includes estimating parametersand testing hypotheses.

    Overview

    Part 1: Frequentist Statistics

    Chapter 1: Likelihood, sufficiency and ancillarity. The Factoriza-tion Theorem. Exponential family models.

    Chapter 2: Point estimation. When is an estimator a good estima-tor? Covering bias and variance, information, efficiency. Methodsof estimation: Maximum likelihood estimation, nuisance parame-ters and profile likelihood; method of moments estimation. Biasand variance approximations via the delta method.

    Chapter 3: Hypothesis testing. Pure significance tests, signifi-cance level. Simple hypotheses, Neyman-Pearson Lemma. Testsfor composite hypotheses. Sample size calculation. Uniformlymost powerful tests, Wald tests, score tests, generalised likelihoodratio tests. Multiple tests, combining independent tests.

    Chapter 4: Interval estimation. Confidence sets and their con-nection with hypothesis tests. Approximate confidence intervals.Prediction sets.

    Chapter 5: Asymptotic theory. Consistency. Asymptotic nor-mality of maximum likelihood estimates, score tests. Chi-squareapproximation for generalised likelihood ratio tests. Likelihoodconfidence regions. Pseudo-likelihood tests.

    Part 2: Bayesian Statistics Chapter 6: Background. Interpretations of probability; the Bayesian

    paradigm: prior distribution, posterior distribution, predictivedistribution, credible intervals. Nuisance parameters are easy.

    1

  • Chapter 7: Bayesian models. Sufficiency, exchangeability. DeFinettis Theorem and its intepretation in Bayesian statistics.

    Chapter 8: Prior distributions. Conjugate priors. Noninformativepriors; Jeffreys priors, maximum entropy priors posterior sum-maries. If there is time: Bayesian robustness.

    Chapter 9: Posterior distributions. Interval estimates, asymp-totics (very short).

    Part 3: Decision-theoretic approach: Chapter 10: Bayesian inference as a decision problem. Deci-

    sion theoretic framework: point estimation, loss function, deci-sion rules. Bayes estimators, Bayes risk. Bayesian testing, Bayesfactor. Lindleys paradox. Least favourable Bayesian answers.Comparison with classical hypothesis testing.

    Chapter 11: Hierarchical and empirical Bayes methods. Hierar-chical Bayes, empirical Bayes, James-Stein estimators, Bayesiancomputation.

    Part 4: Principles of inference. The likelihood principle. The condi-tionality principle. The stopping rule principle.

    Books

    1. Bernardo, J.M. and Smith, A.F.M. (2000) Bayesian Theory. Wiley.

    2. Casella, G. and Berger, R.L. (2002) Statistical Inference. Second Edi-tion. Thomson Learning.

    3. Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. Chapmanand Hall.

    4. Garthwaite, P.H., Joliffe, I.T. and Jones, B. (2002) Statistical Inference.Second Edition. Oxford University Press.

    5. Leonard, T. and Hsu, J.S.J. (2001) Bayesian Methods. CambridgeUniversity Press.

    2

  • 6. Lindgren, B.W. (1993) Statistical Theory. 4th edition. Chapman andHall.

    7. OHagan, A. (1994) Kendalls Advanced Theory of Statistics. Vol 2B,Bayesian Inference. Edward Arnold.

    8. Young, G.A. and Smith, R.L. (2005) Essential of Statistical Inference.Cambridge University Press.

    Lecture take place Mondays 11-12 and Wednesdays 9-10. There will befour problem sheets. Examples classes are held Thursdays 12-1 in weeks 3,4, 6, and 8.

    While the examples classes will cover problems from the problem sheets,there may not be enough time to cover all the problems. You will benefitmost from the examples classes if you (attempt to) solve the problems on thesheet ahead of the examples classes.

    You are invited to hand in your work on the respective problem sheetson Tuesdays at 5 pm in weeks 3, 4, 6, and 8. Your marker is Eleni Frangou;there will be a folder at the departmental pigeon holes.

    Additional material may be published at http://stats.ox.ac.uk/~reinert/stattheory/stattheory.htm.

    The lecture notes may cover more material than the lectures.

    3

  • Part I

    Frequentist Statistics

    4

  • Chapter 1

    Likelihood, sufficiency andancillarity

    We start with data x1, x2, . . . , xn, which we would like to use to draw inferenceabout a parameter .Model: We assume that x1, x2, . . . , xn are realisations of some random vari-ables X1, X2, . . . , Xn, from a distribution which depends on the parameter.Often we use the model that X1, X2, . . . , Xn independent, identically dis-tributed (i.i.d.) from some fX(x, ) (probability density or probability massfunction). We then say x1, x2, . . . , xn is a random sample of size n fromfX(x, ) (or, shorter, from f(x, )).

    1.1 Likelihood

    If X1, X2, . . . , Xn i.i.d. f(x, ), then the joint density at x = (x1, . . . , xn)is

    f(x, ) =ni=1

    f(xi, ).

    Inference about given the data is based on the Likelihood functionL(,x) = f(x, ); often abbreviated by L(). In the i.i.d. model, L(,x) =n

    i=1 f(xi, ). Often it is more convenient to use the log likelihood `(,x) =logL(,x) (or, shorter, `()). Here and in future, the log denotes the naturallogarithm; elog x = x.

    5

  • Example: Normal distribution. Assume that x1, . . . , xn is a randomsample from N (, 2), where both and 2 are unknown parameters, R, 2 > 0. With = (, 2), the likelihood is

    L() =n

    i=1(2pi2)1/2 exp

    { 122

    (xi )2}

    = (2pi2)n/2 exp{ 1

    22

    ni=1(xi )2

    }and the log-likelihood is

    `() = n2

    log(2pi) n log 122

    ni=1

    (xi )2.

    Example: Poisson distribution. Assume that x1, . . . , xn is a randomsample from Poisson(), with unknown > 0; then the likelihood is

    L() =ni=1

    (e

    xi

    xi!

    )= en

    ni=1 xi

    ni=1

    (xi!)1

    and the log-likelihood is

    `() = n + log()ni=1

    xi ni=1

    log(xi!).

    1.2 Sufficiency

    Any function of X is a statistic. We often write T = t(X), where t is afunction. Some examples are the sample mean, the sample median, and theactual data. Usually we would think of a statistic as being some summaryof the data, so smaller in dimension than the original data.A statistic is sufficient for the parameter if it contains all information about that is available from the data: L(X|T ), the conditional distribution of Xgiven T , does not depend on .

    Factorisation Theorem (Casella + Berger, p.250) A statistic T = t(X)is sufficient for if and only if there exists functions g(t, ) and h(x) suchthat for all x and

    f(x, ) = g(t(x), )h(x).

    6

  • Example: Bernoulli distribution. Assume that X1, . . . , Xn are i.i.d.Bernoulli trials, Be(), so f(x, ) = x(1 )1x; let T = ni=1 Xi denotenumber of successes. Recall that T Bin(n, ), and so

    P (T = t) =

    (n

    t

    )t(1 )nt, t = 0, 1, . . . , n.

    Then

    P (X1 = x1, . . . , Xn = xn|T = t) = 0 forni=1

    xi 6= t,

    and forn

    i=1 xi = t,

    P (X1 = x1, . . . , Xn = xn|T = t) = P (X1 = x1, . . . , Xn = xn)P (T = t)

    =

    ni=1

    (xi(1 )(1xi))(

    nt

    )t(1 )nt

    =t(1 )nt(nt

    )t(1 )nt =

    (n

    t

    )1.

    This expression is independent of , so T is sufficient for .

    Alternatively, the Factorisation Theorem gives

    f(x, ) =ni=1

    (xi(1 )(1xi)) = ni=1 xi(1 )nni=1 xi

    = g(t(x), )h(x)

    with t =n

    i=1 xi; g(t, ) = t(1)nt and h(x) = 1, so T = t(X) is sufficient

    for .

    Example: Normal distribution. Assume that X1, . . . , Xn are i.i.d. N (, 2); put x = 1

    n

    ni=1 xi and s

    2 = 1n1

    ni=1(xi x)2, then

    f(x, ) = (2pi2)n/2 exp

    { 1

    22

    ni=1

    (xi )2}

    = exp

    {n(x )

    2

    22

    }(2pi2)

    n2 exp

    {(n 1)s

    2

    22

    }.

    7

  • If 2 is known: = , t(x) = x, and g(t, ) = exp{n(x)2

    22

    }, so X is

    sufficient.If 2 is unknown: = (, 2), and f(x, ) = g(x, s2, ), so (X,S2) is sufficient.

    Example: Poisson distribution. Assume that x1, . . . , xn are a randomsample from Poisson(), with unknown > 0. Then

    L() = enni=1 xi

    ni=1

    (xi!)1.

    and (exercise)

    t(x) =

    g(t, ) =

    h(x) =

    Example: order statistics. Let X1, . . . , Xn be i.i.d.; the order statis-tics are the ordered observations X(1) X(2) X(n). Then T =(X(1), X(2), , X(n)) is sufficient.

    1.2.1 Exponential families

    Any probability density function f(x|) which is written in the form

    f(x, ) = exp

    {ki=1

    cii()hi(x) + c() + d(x),

    }, x X ,

    where c() is chosen such thatf(x, ) dx = 1, is said to be in the k-

    parameter exponential family. The family is called regular if X does notdepend on ; otherwise it is called non-regular.

    Examples include the binomial distribution, Poisson distributions, normaldistributions, gamma (including exponential) distributions, and many more.

    Example: Binomial (n, ). For x = 0, 1, ..., n,

    f(x; ) =

    (n

    x

    )x(1 )nx

    = exp

    {x log

    (

    1 )

    + log

    ((n

    x

    ))+ n log(1 )

    }.

    8

  • Choose k = 1 and

    c1 = 1

    1() = log

    (

    1 )

    h1(x) = x

    c() = n log(1 )d(x) = log

    ((n

    x

    ))X = {0, . . . , n}.

    Fact: In k-parameter exponential family models,

    t(x) = (n,nj=1

    h1(xj), . . . ,nj=1

    hk(xj))

    is sufficient.

    1.2.2 Minimal sufficiency

    A statistic T which is sufficient for is minimal sufficient for is if can beexpressed as a function of any other sufficient statistic. To find a minimalsufficient statistic: Suppose f(x,)

    f(y,)is constant in if and only if

    t(x) = t(y),

    then T = t(X) is minimal sufficient (see Casella + Berger p.255).In order to avoid issues when the density could be zero, it is the case that

    if for any possible values for x and y, we have that the equation

    f(x, ) = (x,y)f(y, ) for all

    implies that t(x) = t(y), where is a function which does not depend on ,then T = t(X) is minimal sufficent for .

    Example: Poisson distribution. Suppose that X1, . . . , Xn are i.i.d.Po(), then f(x, ) = en

    ni=1 xi

    ni=1(xi!)

    1 and

    f(x, )

    f(y, )=

    ni=1 xi

    ni=1 yi

    ni=1

    yi!

    xi!,

    9

  • which is constant in if and only if

    ni=1

    xi =ni=1

    yi;

    so T =n

    i=1 Xi is minimal sufficient (as is X). Note: T =n

    i=1 X(i) is afunction of the order statistic.

    1.3 Ancillary statistic

    If a(X) is a statistics whose distribution does not depend on , it is calledan ancillary statistic.

    Example: Normal distribution. Let X1, . . . , Xn be i.i.d. N (, 1).Then T = X2 X1 N (0, 2) has a distribution which does not depend on; it is ancillary.

    When a minimal sufficient statistic T is of larger dimension than , thenthere will often be a component of T whose distribution is independent of .

    Example: some uniform distribution (Exercise). Let X1, . . . , Xn bei.i.d. U [ 1

    2, + 1

    2] then

    (X(1), X(n)

    )is minimal sufficient for , as is

    (S,A) =

    (1

    2(X(1) +X(n)), X(n) X(1)

    ),

    and the distribution of A is independent of , so A is an ancillary statistic.Indeed, A measures the accuracy of S; for example, if A = 1 then S = withcertainty.

    10

  • Chapter 2

    Point Estimation

    Recall that we assume our data x1, x2, . . . , xn to be realisations of randomvariables X1, X2, . . . , Xn from f(x, ). Denote the expectation with respectto f(x, ) by E, and the variance by Var.

    When we estimate by a function t(x1, . . . , xn) of the data, this is calleda point estimate; T = t(X1, . . . , Xn) = t(X) is called an estimator (random).For example, the sample mean is an estimator of the mean.

    2.1 Properties of estimators

    T is unbiased for if E(T ) = for all ; otherwise T is biased. The bias ofT is

    Bias(T ) = Bias(T ) = E(T ) .

    Example: Sample mean, sample variance. Suppose that X1, . . . , Xnare i.i.d. with unknown mean ; unknown variance 2. Consider the estimateof given by the sample mean

    T = X =1

    n

    ni=1

    Xi.

    Then

    E,2(T ) =1

    n

    ni=1

    = ,

    11

  • so the sample mean is unbiased. Recall that

    V ar,2(T ) = V ar,2(X) = E,2{(X )2)} = 2

    n.

    If we estimate 2 by

    S2 =1

    n 1ni=1

    (Xi X)2,

    then

    E,2(S2)

    =1

    n 1ni=1

    E,2{(Xi + X)2}

    =1

    n 1ni=1

    {E,2{(Xi )2}+ 2E,2(Xi )(X)

    +E,2{(X )2}}

    =1

    n 1ni=1

    2 2 nn 1E,2{(X )

    2}+ nn 1E,2{(X )

    2}

    = 2(

    n

    n 1 2

    n 1 +1

    n 1)

    = 2,

    so S2 is an unbiased estimator of 2. Note: the estimator 2 = 1n

    ni=1(Xi

    X)2 is not unbiased.

    Another criterion for estimation is a small mean square error (MSE);the MSE of an estimator T is defined as

    MSE(T ) = MSE(T ) = E{(T )2} = V ar(T ) + (Bias(T ))2.

    Note: MSE(T ) is a function of .

    Example: 2 has smaller MSE than S2 (see Casella and Berger, p.304)but is biased.

    If one has two estimators at hand, one being slightly biased but havinga smaller MSE than the second one, which is, say, unbiased, then one may

    12

  • well prefer the slightly biased estimator. Exception: If the estimate is to becombined linearly with other estimates from independent data.

    The efficiency of an estimator T is defined as

    Efficiency(T ) =Var(T0)

    Var(T ),

    where T0 has minimum possible variance.

    Theorem: Cramer-Rao Inequality, Cramer-Rao lower bound:Under regularity conditions on f(x, ), it holds that for any unbiased T ,

    Var(T ) (In())1,where

    In() = E

    [(`(,X)

    )2]is the expected Fisher information of the sample.

    Thus, for any unbiased estimator T ,

    Efficiency(T ) =1

    In()Var(T ).

    Assume that T is unbiased. T is called efficient (or a minimum varianceunbiased estimator) if it has the minimum possible variance. An unbiasedestimator T is efficient if Var(T ) = (In())

    1.

    Often T = T (X1, . . . , Xn) is efficient as n: then it is called asymp-totically efficient.

    The regularity conditions are conditions on the partial derivatives off(x, ) with respect to ; and the domain may not depend on ; for exampleU [0, ] violates the regularity conditions.

    Under more regularity: the first three partial derivatives of f(x, ) withrespect to are integrable with respect to x; and again the domain may notdepend on ; then

    In() = E

    [

    2`(,X)

    2

    ].

    13

  • Notation: We shall often omit the subscript in In(), when it is clearwhether we refer to a sample of size 1 or of size n. For a random sample,

    In() = nI1().

    Calculation of the expected Fisher information:

    In() = E

    [(`(,X)

    )2]

    =

    f(x, )

    [( log f(x, )

    )2]dx

    =

    f(x, )

    [1

    f(x, )

    (f(x, )

    )]2dx

    =

    1

    f(x, )

    [(f(x, )

    )2]dx.

    Example: Normal distribution, known variance For a random sam-ple X from N (, 2), where 2 is known, and = ,

    `() = n2

    log(2pi) n log 122

    ni=1

    (xi )2;

    `

    =

    1

    2

    ni=1

    (xi ) = n2

    (x )

    and

    In() = E

    [(`(,X

    )2]

    =n2

    4E(X )2 = n

    2.

    Note: Var(X) =2

    n, so X is an efficient estimator for . Also note that

    2`

    2= n

    2;

    14

  • a quicker method yielding the expected Fisher information.

    In future we shall often omit the subscript in the expectation and inthe variance.

    Example: Exponential family models in canonical formRecall that one-parameter (i.e., scalar ) exponential family density has theform

    f(x; ) = exp {()h(x) + c() + d(x)} , x X .

    Choosing and x to make () = and h(x) = x is called the canonicalform;

    f(x; ) = exp{x+ c() + d(x)}.For the canonical form

    EX = () = c(), and VarX = 2() = c().Exercise: Prove the mean and variance results by calculating the moment-

    generating function Eexp(tX) = exp{c()c(t+)}. Recall that you obtainmean and variance by differentiating the moment-generating function (howexactly?)

    Example: Binomial (n, p). Above we derived the exponential familyform with

    c1 = 1

    1(p) = log

    (p

    1 p)

    h1(x) = x

    c(p) = n log(1 p)d(x) = log

    ((n

    x

    ))X = {0, . . . , n}.

    To write the density in canonical form we put

    = log

    (p

    1 p)

    15

  • (this transformation is called the logit transformation); then

    p =e

    1 + e

    and

    () =

    h(x) = x

    c() = n log (1 + e)d(x) = log

    ((n

    x

    ))X = {0, . . . , n}

    gives the canonical form. We calculate the mean

    c() = n e

    1 + e= () = np

    and the variance

    c() = n{

    e

    1 + e e

    2

    (1 + e)2

    }= 2() = np(1 p).

    Now suppose X1, . . . , Xn are i.i.d., from the canonical density. Then

    `() =

    xi + nc() +

    d(xi),

    `() =

    xi + nc() = n(x+ c()).

    Since `() = nc(), we have that In() = E(`()) = nc().

    Example: Binomial (n, p) and

    = log

    (p

    1 p)

    then (exercise)I1() =

    16

  • 2.2 Maximum Likelihood Estimation

    Now may be a vector. A maximum likelihood estimate, denoted (x), is avalue of at which the likelihood L(,x) is maximal. The estimator (X)is called MLE (also, (x) is sometimes called mle). An mle is a parametervalue at which the observed sample is most likely.

    Often it is easier to maximise log likelihood: if derivatives exist, thenset first (partial) derivative(s) with respect to to zero, check that second(partial) derivative(s) with respect to less than zero.

    An mle is a function of a sufficient statistic:

    L(,x) = f(x, ) = g(t(x), )h(x)

    by the factorisation theorem, and maximizing in depends on x only throught(x).

    An mle is usually efficient as n.

    Invariance property: An mle of a function () is () (Casella + Bergerp.294). That is, if we define the likelihood induced by as

    L(, x) = sup:()=

    L(, x),

    then one can calculate that for = (),

    L(, x) = L(, x).

    Examples: Uniforms, normal

    1. X1, . . . , Xn i.i.d. U [0, ]:

    L() = n1[x(n),)(),

    where x(n) = max1in xi; so = X(n)

    2. X1, . . . , Xn i.i.d. U [ 12 , + 12 ], then any [x(n) 12 , x(1) + 12 ]maximises the likelihood (Exercise)

    3. X1, . . . , Xn i.i.d. N (, 2), then (Exercise) = X, 2 = 1nn

    i=1(XiX)2. So 2 is biased, but Bias(2) 0 as n.

    17

  • Iterative computation of MLEs

    Sometimes the likelihood equations are difficult to solve. Suppose (1) isan initial approximation for . Using Taylor expansion gives

    0 = `() `((1)) + ( (1))`((1)),

    so that

    (1) `((1))

    `((1)).

    Iterate this procedure (this is called the Newton-Raphson method) to get

    (k+1) = (k) (`((k)))1`((k)), k = 2, 3, . . . ;

    continue the iteration until |(k+1) (k)| < for some small .As E

    {`((1))

    }= In(

    (1)) we could instead iterate

    (k+1) = (k) + I1n ((k))`((k)), k = 2, 3, . . .

    until |(k+1) (k)| < for some small . This is Fishers modification of theNewton-Raphson method.

    Repeat with different starting values to reduce the risk of finding just alocal maximum.

    Example: Suppose that we observe x from the distributionBinomial(n, ). Then

    `() = x ln() + (n x) ln(1 ) + log(n

    x

    )`() =

    x

    n x

    1 =x n(1 )

    `() = x2 n x

    (1 )2I1() =

    n

    (1 )

    18

  • Assume that n = 5, x = 2, = 0.01 (in practice rather = 105); andstart with an initial guess (1) = 0.55. Then Newton-Raphson gives

    `((1)) 3.03(2) (1) (`((1)))1`((1)) 0.40857

    `((2)) 0.1774(3) (2) (`((2)))1`((2)) 0.39994.

    Now |(3) (2)| < 0.01 so we stop.Using instead Fisher scoring gives

    I11 ()`() =

    x nn

    =x

    n

    and so

    + I11 ()`() =

    x

    n

    for all . To compare: analytically, = xn

    = 0.4.

    2.3 Profile likelihood

    Often = (, ), where contains the parameters of interest and containsthe other unknown parameters: nuisance parameters. Let be the MLEfor for a given value of . Then the profile likelihood for is

    LP () = L(, )

    (in L(, ) replace by ); the profile log-likelihood is `P () = log[LP ()].For point estimation, maximizing LP () with respect to gives the same

    estimator as maximizing L(, ) with respect to both and (but possiblydifferent variances)

    Example: Normal distribution. Suppose that X1, . . . , Xn are i.i.d.from N (, 2) with and 2 unknown. Given , 2 = (1/n)

    (xi )2,

    and given 2, 2 = x. Hence the profile likelihood for is

    LP () = (2pi2)n/2 exp

    { 1

    22

    ni=1

    (xi )2}

    =

    [2pie

    n

    (xi )2

    ]n/2,

    19

  • which gives = x; and the profile likelihood for 2 is

    LP (2) = (2pi2)n/2 exp

    { 1

    22

    (xi x)2

    },

    gives (Exercise)2 =

    2.4 Method of Moments (M.O.M)

    The idea is to match population moments to sample moments in order toobtain estimators. Suppose that X1, . . . , Xn are i.i.d. f(x; 1, . . . , p).Denote by

    k = k() = E(Xk)

    the kth moment and by

    Mk =1

    n

    (Xi)

    k

    the kth sample moment. In general, k = k(1, . . . , p). Solve the equation

    k() = Mk

    for k = 1, 2, . . . , until there are sufficient equations to solve for 1, . . . , p(usually p equations for the p unknowns). The solutions 1, . . . , p are themethod of moments estimators.

    They are often not as efficient as MLEs, but may be easier to calculate.They could be used as initial estimates in an iterative calculation of MLEs.

    Example: Normal distribution. Suppose that X1, . . . , Xn are i.i.d.N (, 2); and 2 are unknown. Then

    1 = ; M1 = X

    Solve = X

    so = X.

    Furthermore

    2 = 2 + 2; M2 =

    1

    n

    ni=1

    X2i

    20

  • so solve

    2 + 2 =1

    n

    ni=1

    X2i ,

    which gives

    2 = M2 M21 =1

    n

    ni=1

    (Xi X)2

    (which is not unbiased for 2).

    Example: Gamma distribution. Suppose that X1, . . . , Xn are i.i.d.(, );

    f(x;, ) =1

    ()x1ex for x 0.

    Then 1 = EX = / and

    2 = EX2 = /2 + (/)2.

    Solving

    M1 = /, M2 = /2 + (/)2

    for and gives

    = X2/

    [n1ni=1

    (Xi X)2], and = X/

    [n1ni=1

    (Xi X)2].

    2.5 Bias and variance approximations: the

    delta method

    Sometimes T is a function of one or more averages whose means and variancescan be calculated exactly; then we may be able to use the following simpleapproximations for mean and variance of T :

    Suppose T = g(S) where ES = and VarS = V . Taylor expansion gives

    T = g(S) g() + (S )g().

    21

  • Taking the mean and variance of the r.h.s.:

    ET g(), VarT [g()]2V.

    If S is an average so that the central limit theorem (CLT) applies to it,i.e., S N(, V ), then

    T N(g(), [g()]2V )

    for large n.If V = v(), then it is possible to choose g so that T has approximately

    constant variance in : solve [g()]2v() = constant.

    Example: Exponential distribution. X1, . . . , Xn i.i.d. exp( 1),mean . Then S = X has mean and variance 2/n. If T = logX theng() = log(), g() = 1, and so VarT n1, which is independent of :this is called a variance stabilization.

    If the Taylor expansion is carried to the second-derivative term, we obtain

    ET g() + 12V g().

    In practice we use numerical estimates for and V if unknown.

    When S, are vectors (V a matrix), with T still a scalar: Let(g()

    )i

    =g/i and let g

    () be the matrix of second derivatives, then Taylor expan-sion gives

    VarT [g()]TV g()and

    ET g() + 12

    trace[g()V ].

    2.5.1 Exponential family models in canonical form andasymptotic normality of the MLE

    Recall that a one-parameter (i.e., scalar ) exponential family density incanonical form can be written as

    f(x; ) = exp{x+ c() + d(x)},

    22

  • and EX = () = c(), as well as VarX = 2() = c(). SupposeX1, . . . , Xn are i.i.d., from the canonical density. Then

    `() =

    xi + nc() = n(x+ c()).

    Since () = c(),`() = 0 x = (),

    and we have already calculated that In() = E(`()) = nc(). If isinvertible, then

    = 1(x).

    The CLT applies to X so, for large n,

    X N ((),c()/n).With the delta-method, S N (a, b) implies that

    g(S) N (g(a), b[g(a)]2)for continuous g, and small b. For S = X, with g() = 1() we haveg() = ((1())1, thus

    N (, I1n ()) :the M.L.E. is asymptotically normal.

    Note: The approximate variance equals the Cramer-Rao lower bound:quite generally the MLE is asymptotically efficient.

    Example: Binomial(n, p). With = log(

    p1p

    )we have () = n e

    1+e,

    and we calculate

    1(t) = log( t

    n

    1 tn

    ).

    Note that here we have a sample, x, of size 1. This gives

    = log

    ( xn

    1 xn

    ),

    as expected from the invariance of mles. We hence know that is approxi-mately normally distributed.

    23

  • 2.6 Excursions

    2.6.1 Minimum Variance Unbiased Estimation

    There is a pretty theory about how to construct minimum variance unbiasedestimators (MVUE) based on sufficient statistics. The key underlying resultis the Rao-Blackwell Theorem (Casella+Berger p.316). We do not have timeto go into detail during lectures, but you may like to read up on it.

    2.6.2 A more general method of moments

    Consider statistics of the form 1n

    ni=1 h(Xi). Find the expected value as a

    function of 1

    n

    ni=1

    Eh(Xi) = r().

    Now obtain an estimate for by solving r() = 1n

    ni=1 h(Xi) for .

    2.6.3 The delta method and logistic regresstion

    For logistic regression, the outcome of an experiment is 0 or 1, and theoutcome may depend on some explanatory variables. We are interested in

    P (Yi = 1|x) = pi(x|).The outcome for each experiment is in [0, 1]; in order to apply some normalregression model we use the logit transform,

    logit(p) = log

    (p

    1 p)

    which is now spread over the whole real line. The ratio p1p is also called the

    odds. A (Generalised linear) model then relates the logit to the regressors ina linear fashion;

    logit(pi(x|)) = log(

    pi(x|)1 pi(x|)

    )= xT.

    The coefficients describe how the odds for pi change with change in theexplanatory variables. The model can now be treated like an ordinary linear

    24

  • regression, X is the design matrix, is the vector of coefficients. Transform-ing back,

    P (Yi = 1|x) = exp(xT)/(

    1 + exp(xT)).

    The invariance property gives that the MLE of pi(x|), for any x, is pi(x|),where is the MLE obtained in the ordinary linear regression from a sampleof responses y1, . . . , yn with associated covariate vectors x1, . . . , xn. We knowthat is approximately normally distributed, and we would like to inferasymptotic normality of pi(x|).

    (i) If is scalar: Calculate that

    pi(xi|) =

    exp(xi)

    /(1 + exp(xi))

    = xiexi (1 + exp(xi))

    1 (1 + exp(xi))2 xiexiexi= xipi(xi|) xi(pi(xi|))2= xipi(xi|)(1 pi(xi|))

    and the likelihood is

    L() =ni=1

    pi(xi|) =ni=1

    exp(xi)/

    (1 + exp(xi)) .

    Hence the log likelihood has derivative

    `() =ni=1

    1

    pi(xi|)xipi(xi|)(1 pi(xi|))

    =ni=1

    xi(1 pi(xi|))

    so that

    `() = ni=1

    x2ipi(xi|))(1 pi(xi|)).

    Thus N (, I1n())

    where In() =x2ipii(1 pii) with pii = pi(xi|).

    25

  • So now we know the parameters of the normal distribution which approx-imates the distribution of . The delta method with g() = ex/(1 + ex),gives

    g() = xg()(1 g())

    and hence we conclude that pi = pi(x|) N (pi, pi2(1 pi)2x2I1()).(ii) If is vector: Similarly it is possible to calculate that N (, I1n ())

    where [In()]kl = E (2`/kl). The vector version of the delta methodthen gives

    pi(x|) N (pi, pi2(1 pi)2xT I1()x)with pi = pi(x|) and In() = XTRX. Here X is the design matrix, and

    R = Diag (pii(1 pii), i = 1, . . . , n)

    where pii = pi(xi|). Note that this normal approximation is likely to be poorfor pi near zero or one.

    26

  • Chapter 3

    Hypothesis Testing

    3.1 Pure significance tests

    We have data x = (x1, . . . , xn) from f(x, ), and a hypothesis H0 whichrestricts f(x, ). We would like to know: Are the data consistent with H0?

    H0 is called the null hypothesis. It is called simple if it completely specifiesthe density of x; it is called composite otherwise.

    A pure significance test is a means of examining whether the data areconsistent with H0 where the only distribution of the data that is explicitlyformulated is that under H0. Suppose that for a test statistic T = t(X),the larger t(x), the more inconsistent the data with H0. For simple H0, thep-value of x is then

    p = P (T t(x)|H0).Small p indicate more inconsistency with H0.

    For composite H0: If S is sufficient for then the distribution of Xconditional on S is independent of ; when S = s, the p-value of x is

    p = P (T t(x)|H0;S = s).

    Example: Dispersion of Poisson distribution. Let H0: X1, . . . , Xni.i.d. Poisson(), with unknown . Under H0, Var(Xi) = E(Xi) = andso we would expect T = t(X) = S2/X to be close to 1. The statistic T isalso called the dispersion index.

    27

  • We suspect that the Xis may be over-dispersed, that is, Var(Xi) > EXi:discrepancy with H0 would then correspond to large T . Recall that X is suf-ficient for the Poisson distribution; the p-value under the Poisson hypothesisis then p = P (S2/X t(x)|X = x;H0), which makes p independent of theunknown . Given X = x and H0 we have that

    S2/X 2n1/(n 1)(see Chapter 5 later) and so the p-value of the test satisfies

    p P (2n1/(n 1) t(x)).

    Possible alternatives toH0 guide the choice and interpretation of T . Whatis a best test?

    3.2 Simple null and alternative hypotheses:

    The Neyman-Pearson Lemma

    The general setting here is as follows: we have a random sample X1, . . . , Xnfrom f(x; ), and two hypotheses:

    a null hypothesis H0 : 0

    an alternative hypothesis H1 : 1

    where 1 = \ 0; denotes the whole parameter space. We want tochoose a rejection region or critical region R such that

    reject H0 X R.Now suppose that H0 : = 0, and H1 : = 1 are both simple. The

    Type I error is: reject H0 when it is true;

    = P (reject H0|H0),this is also known as size of the test. The Type II error is: accept H0 whenit is false;

    = P (accept H0|H1).

    28

  • The power of the test is 1 = P (accept H1|H1).Usually we fix (e.g., = 0.05, 0.01, etc.), and we look for a test which

    minimizes : this is called a most powerful or best test of size .

    Intuitively: we reject H0 in favour of H1 if likelihood of 1 is much largerthan likelihood of 0, given the data.

    Neyman-Pearson Lemma: (see, e.g., Casella and Berger, p.366). Themost powerful test at level of H0 versus H1 has rejection region

    R =

    {x :

    L(1; x)

    L(0; x) k

    }where the constant k is chosen so that

    P (X R|H0) = .This test is called the the likelihood ratio (LR) test.

    Often we simplify the condition L(1; x)/L(0; x) k tot(x) c,

    for some constant c and some statistic t(x); determine c from the equation

    P (T c|H0) = ,where T = t(X); then the test is reject H0 if and only if T c. For datax the p-value is p = P (T t(x)|H0).

    Example: Normal means, one-sided. Suppose that we have a ran-dom sample X1, . . . , Xn N(, 2), wioth 2 known; let 0 and 1 be given,and assume that 1 > 0. We would like to test H0 : = 0 against H1 : = 1. Using the Neyman-Pearson Lemma, the best test is a likelihood ratiotest. To calculate the test, we note that

    L(1; x)

    L(0; x) k `(1; x) `(0; x) log k

    [

    (xi 1)2 (xi 0)2] 22 log k

    (1 0)x k x c (since 1 > 0),

    29

  • where k, c are constants, independent of x. Hence we choose t(x) = x, andfor size test we choose c so that P (X c|H0) = ; equivalently, such that

    P

    (X 0/n c 0/n

    H0) = .Hence we want

    (c 0)/(/n) = z1,

    (where (z1) = 1 with being standard normal c.d.f.), i.e.c = 0 + z1/

    n.

    So the most powerful test of H0 versus H1 at level becomes reject H0 ifand only if X 0 + z1/

    n.

    Recall the notation for standard normal quantiles: If Z N (0, 1) thenP(Z z) = and P(Z z()) = ,

    and note that z() = z1. Thus

    P(Z z1) = 1 (1 ) = .

    Example: Bernoulli, probability of success, one-sided. Assumethat X1, . . . , Xn are i.i.d. Bernoulli() then L() =

    r(1 )nr where r =xi. We would like to test H0 : = 0 against H1 : = 1, where 0 and 1

    are known, and 1 > 0. Now 1/0 > 1, (1 1)/(1 0) < 1, and

    L(1; x)

    L(0; x)=

    (10

    )r (1 11 0

    )nrand so L(1; x)/L(0; x) k r r.

    So the best test rejects H0 for large r. For any given critical value rc,

    =n

    j=rc

    (n

    j

    )j0(1 0)nj

    gives the p-value if we set rc = r(x) =xi, the observed value.

    30

  • Note: The distribution is discrete, so we may not be able to achieve alevel test exactly (unless we use additional randomization). For example,if R Binomial(10, 0.5), then P (R 9) = 0.011, and P (R 8) = 0.055,so there is no c such that P (R c) = 0.05. A solution is to randomize: IfR 9 reject the null hypothesis, if R 7 accept the null hypothesis, and ifR = 8 flip a (biased) coin to achieve the exact level of 0.05.

    3.3 Composite alternative hypotheses

    Suppose that scalar, H0 : = 0 is simple, and we test against a compositealternative hypotheses; this could be one-sided:

    H1 : < 0 or H+1 : > 0;

    or a two-sided alternative H1 : 6= 0. The power function of a test dependson the true parameter , and is defined as

    power() = P (X R|);the probability of rejecting H0 as a function of the true value of the param-eter ; it depends on , the size of the test. Its main uses are comparingalternative tests, and choosing sample size.

    3.3.1 Uniformly most powerful tests

    A test of size is uniformly most powerful (UMP) if its power function issuch that

    power() power()for all 1, where power() is the power function of any other size- test.

    Consider testingH0 againstH+1 . For exponential family problems, usually

    for any 1 > 0 the rejection region of the LR test is independent of 1. Atthe same time, the test is most powerful for every single 1 which is largerthan 0. Hence the test derived for one such value of 1 is UMP for H0 versusH+1 .

    Example: normal mean, composite one-sided alternative. Sup-pose that X1, . . . , Xn N(, 2) are i.i.d., with 2 known. We want to test

    31

  • H0 : = 0 against H+1 : > 0. First pick an arbitrary 1 > 0. We have

    seen that the most powerful test of = 0 versus = 1 has a rejectionregion of the form

    X 0 + z1/n

    for a test of size . This rejection region is independent of 1, hence the testwhich rejects H0 when X 0 + z1/

    n is UMP for H0 versus H

    +1 . The

    power of the test is

    power() = P(X 0 + z1/

    n )

    = P(X 0 + z1/

    n X N(, 2/n))

    = P(X /n 0

    /n

    + z1 X N(, 2/n))

    = P(Z z1 0

    /n

    Z N(0, 1))= 1 (z1 ( 0)n/).

    The power increases from 0 up to at = 0 and then to 1 as increases.The power increases as increases.

    Sample size calculation in the Normal exampleSuppose want to be near-certain to reject H0 when = 0 + , say, and havesize 0.05. Suppose we want to fix n to force power() = 0.99 at = 0 + :

    0.99 = 1 (1.645 n/)so that 0.01 = (1.645 n/). Solving this equation (use tables) gives2.326 = 1.645 n/, i.e.

    n = 2(1.645 + 2.326)2/2

    is the required sample size.

    UMP tests are not always available. If not, options include a

    1. Wald test

    2. locally most powerful test (score test)

    3. generalised likelihood ratio test.

    32

  • 3.3.2 Wald tests

    The Wald test is directly based on the asymptotic normality of the m.l.e. = n, often N

    (, I1n ()

    )if is the true parameter. Also it is often

    true that asymptotically, we may replace by in the Fisher information,

    N (, I1n ()).So we can construct a test based on

    W =

    In()( 0) N (0, 1).

    If is scalar, squaring givesW 2 21,

    so equivalently we could use a chi-square test.For higher-dimensional we can base a test on the quadratic form

    ( 0)T In()( 0)which is approximately chi-square distributed in large samples.

    If we would like to test H0 : g() = 0, where g is a (scalar) differentiablefunction, then the delta method gives as test statistic

    W = g(){G()(In())1G()T}1g(),

    where G() = g()

    T.

    An advantage of the Wald test is that we do not need to evaluate thelikelihood under the null hypothesis, which can be awkward if the null hy-pothesis contains a number of restrictions on a multidimensional parameter.All we need is (an approximation) of , the maximum-likelihood-estimator.But there is also a disadvantage:

    Example: Non-invariance of the Wald test

    Suppose that is scalar and approximately N (, In()1)-distributed,then for testing H0 : = 0 the Wald statistic becomes

    In(),

    33

  • which would be approximately standard normal. If instead we tested H0 :3 = 0, then the delta method with g() = 3, so that g() = 32, gives

    V ar(g()) 94(In())1

    and as Wald statistic

    3

    In().

    3.3.3 Locally most powerful test (Score test)

    We consider first the problem to test H0 : = 0 against H1 : = 0 + . forsome small > 0. We have seen that the most powerful test has a rejectionregion of the form

    `(0 + ) `(0) k.Taylor expansion gives

    `(0 + ) `(0) + `(0)

    i.e.

    `(0 + ) `(0) `(0)

    .

    So a locally most powerful (LMP) test has as rejection region

    R =

    {x :

    `(0)

    k

    }.

    This is also called the score test : `/ is known as the score function.Under certain regularity conditions,

    E

    [`

    ]= 0, Var

    [`

    ]= In().

    As ` is usually a sum of independent components, so is `(0)/, and theCLT (Central Limit Theorem) can be applied.

    Example: Cauchy parameter. Suppose that X1, . . . , Xn is a randomsample from Cauchy (), having density

    f(x; ) =[pi(1 + (x )2)]1 for < x

  • Test H0 : = 0 against H+1 : > 0. Then

    `(0; x)

    = 2

    { xi 01 + (xi 0)2

    }.

    Fact: Under H0, the expression `(0; X)/ has mean 0, variance In(0) =n/2. The CLT applies, `(0; X)/ N (0, n/2) under H0, so for the LMPtest,

    P (N (0, n/2) k) = P(N (0, 1) k

    2

    n

    ) .

    This gives k z1n/2, and as rejection region with approximate size

    R =

    {x : 2

    ( xi 01 + (xi 0)2

    )>

    n

    2z1

    }.

    The score test has the advantage that we only need the likelihood underthe null hypothesis. It is also not generally invariant under reparametrisation.

    The multidimensional version of the score test is as follows: Let U =`/ be the score function, then the score statistic is

    UT In(0)1U.

    Compare with a chi-square distribution.

    3.3.4 Generalised likelihood ratio (LR) test

    For testing H0 : = 0 against H+1 : > 0, the generalised likelihood ratio

    test uses as rejection region

    R =

    {x :

    max0 L(; x)L(0; x)

    k}.

    If L has one mode, at the m.l.e. , then the likelihood ratio in the definitionof R is either 1, if 0, or L(; x)/L(0; x), if > 0 (and similarly forH1 , with fairly obvious changes of signs and directions of inequalities).

    The generalised LRT is invariant to a change in parametrisation.

    35

  • 3.4 Two-sided tests

    Test H0 : = 0 against H1 : 6= 0. If the one-sided tests of size havesymmetric rejection regions

    R+ = {x : t > c} and R = {x : t < c},

    then a two-sided test (of size 2) is to take the rejection region to

    R = {x : |t| > c};

    this test has as p-value p = P (|t(X)| t|H0).

    The two-sided (generalised) LR test uses

    T = 2 log

    [max L(; X)

    L(0; X)

    ]= 2 log

    [L(; X)

    L(0; X)

    ]

    and rejects H0 for large T .

    Fact: T 21 under H0 (to be seen in Chapter 5).

    Where possible, the exact distribution of T or of a statistic equivalentto T should be used.

    If is a vector: there is no such thing as a one-sided alternative hypoth-esis. For the alternative 6= 0 we use a LR test based on

    T = 2 log

    [L(; X)

    L(0; X)

    ].

    Under H0, T 2p where p = dimension of (see Chapter 5).

    For the score test we use as statistic

    `(0)T [In(0)]1`(0),

    where In() is the expected Fisher information matrix:

    [In()]jk = E[2`/jk].

    36

  • If the CLT applies to the score function, then this quadratic form is againapproximately 2p under H0 (see Chapter 5).

    Example: Pearsons Chi-square statistic. We have a random sampleof size n, with p categories; P(Xj = i) = pii, for i = 1, . . . , p, j = 1, . . . , n. Aspii = 1, we take = (pi1, . . . , pip1). The likelihood function is then

    pinii

    where ni = # observations in category i (soni = n). We think of

    n1, . . . , np as realisations of random counts N1, . . . , Np. The m.l.e. is =n1(n1, . . . , np1). Test H0 : = 0, where 0 = (pi1,0, . . . , pip1,0), againstH1 : 6= 0.

    The score vector is vector of partial derivatives of

    `() =

    p1i=1

    ni log pii + np log

    (1

    p1k=1

    pik

    )with respect to pi1, . . . , pip1:

    `

    pii=nipii np

    1p1k=1 pik .The matrix of second derivatives has entries

    2`

    piipik= niik

    pi2i np

    (1p1i=1 pii)2 ,where ik = 1 if i = k, and ik = 0 if i 6= k. Minus the expectation of this,using E0(Ni) = npii, gives

    In() = nDiag(pi11 , . . . , pi

    1p1) + n11

    Tpi1p ,

    where 1 is a (p 1)-dimensional vector of ones.Compute

    `(0)T [In(0)]1`(0) =pi=1

    (ni npii,0)2npii,0

    ;

    this statistic is called the chi-squared statistic, T say. The CLT for the scorevector gives that T 2p1 under H0.

    37

  • Note: the form of the chi-squared statistic is(Oi Ei)2/Ei

    where Oi and Ei refer to observed and expected frequencies in category i:This is known as Pearsons chi-square statistic.

    3.5 Composite null hypotheses

    Let = (, ), where is a nuisance parameter. We want a test whichdoes not depend on the unknown value of . Extending two of the previousmethods:

    3.5.1 Generalised likelihood ratio test: Composite nullhypothesis

    Suppose that we want to test H0 : 0 against H1 : 1 = \0. The(generalised) LR test uses the likelihood ratio statistic

    T =max

    L(; X)

    max0

    L(; X)

    and rejects H0 for large values of T .

    Now = (, ). Assuming that is scalar, test H0 : = 0 againstH+1 : > 0. The LR statistic T is

    T =max0,

    L(, )

    max

    L(0, )=

    max0

    LP ()

    LP (0),

    where LP () is the profile likelihood for . For H0 against H1 : 6= 0,

    T =max,

    L(, )

    max

    L(0, )=L(, )

    LP (0).

    Often (see Chapter 5):2 log T 2p

    38

  • where p is the dimension of .An important requirement for this approach is that the dimension of

    does not depend on n.

    Example: Normal distribution and Student t-test. Suppose thatX is a random sample of size n, from N(, 2), where both and areunknown; we would like to test H0 : = 0. Ignoring an irrelevant additiveconstant,

    `() = n log n(x )2 + (n 1)s222

    .

    Maximizing this w.r.t. with fixed gives

    `P () = n2

    log

    ((n 1)s2 + n(x )2

    n

    ).

    If our alternative is H+1 : > 0 then we maximize `P () over 0:if x 0 then the maximum is at = 0; if x > 0 then the maximum is at = x. So log T = 0 when x 0 and is

    n2

    log

    ((n 1)s2

    n

    )+n

    2log

    ((n 1)s2 + n(x 0)2

    n

    )=

    n

    2log

    (1 +

    n(x 0)2(n 1)s2

    )when x > 0. Thus the LR rejection region is of the form

    R = {x : t(x) c},where

    t(x) =

    n(x 0)

    s.

    This statistic is called Student-t statistic. Under H0, t(X) tn1, and for asize test set c = tn1,1; the p-value is p = P(tn1 t(x)). Here we usethe quantile notation P(tn1 tn1,1) = .

    The two-sided test of H0 against H1 : 6= 0 is easier, as unconstrainedmaxima are used. The size test has rejection region

    R = {x : |t(x)| tn1,1/2}.

    39

  • 3.5.2 Score test: Composite null hypothesis

    Now = (, ) with scalar, test H0 : = 0 against H+1 : > 0 or

    H1 : < 0. The score test statistic is

    T =`(0, 0; X)

    ,

    where 0 is the MLE for when H0 is true. Large positive values of Tindicate H+1 , and large negative values indicate H

    1 . Thus the rejection

    regions are of the form T k+ when testing against H+1 , and T k whentesting against H1 .

    Recall the derivation of the score test,

    `(0 + ) `(0) `(0)

    = T.

    If > 0, i.e. for H+1 , we reject if T is large; if < 0, i.e. for H1 , we reject if

    T is small.

    Sometimes the exact null distribution of T is available; more often weuse that T normal (by CLT, see Chapter 5), zero mean. To find theapproximate variance:

    1. compute In(0, )

    2. invert to I1n

    3. take the diagonal element corresponding to

    4. invert this element

    5. replace by the null hypothesis MLE 0.

    Denote the result by v, then Z = T/v N (0, 1) under H0.

    A considerable advantage is that the unconstrained MLE is not re-quired.

    Example: linear or non-linear model? We can extend the linearmodel Yj = (x

    Tj ) + j, where 1, . . . , n i.i.d. N (0, 2), to a non-linear model

    Yj = (xTj )

    + j

    40

  • with the same s. Test H0 : = 1: usual linear model, against, say,H1 : < 1. Here our nuisance parameters are

    T = (T , 2).

    Write j = xTj , and denote the usual linear model fitted values by j0 =

    xTj 0, where the estimates are obtained under H0. As Yj N (j, 2), wehave up to an irrelevant additive constant,

    `(, , ) = n log 122

    (yj j )2,

    and so`

    =

    1

    2

    (yj j )j log j,

    yielding that the null MLEs are the usual LSEs (least-square estimates),which are

    0 = (XTX)1XTY, 2 = n1

    (Yj xTj 0)2.

    So the score test statistic becomes

    T =1

    2

    (Yj j0)(j0 log j0).

    We reject H0 for large negative values of T .

    Compute the approximate null variance (see below):

    In(0, , ) =1

    2

    u2j ujxTj 0ujxj xjxTj 00 0 2n

    where uj = j log j. The (1, 1) element of the inverse of In has reciprocal(

    uTu uTX(XTX)1XTu) /2,where uT = (u1, . . . , un). Substitute j0 for j and

    2 for 2 to get v. Forthe approximate p-value calculate z = t/

    v and set p = (z).

    Calculation trick: To compute the (1, 1) element of the inverse of In above:if

    A =

    (a xT

    x B

    )41

  • where a is a scalar, x is an (n 1) 1 vector and B is an (n 1) (n 1)matrix, then (A1)11 = 1/(a xTB1x).

    Recall also:

    =

    e ln = ln e ln = ln .

    For the (1, 1)-entry of the information matrix, we calculate

    2`

    2=

    1

    2

    {(j log j)j log j + (yj j )j (log j)2

    },

    and as Yj N (j, 2) we have

    E

    {

    2`

    2

    }=

    1

    2

    j log j

    j log j =

    1

    2

    u2j ,

    as required. The off-diagonal terms in the information matrix can be calcu-lated in a similar way, using that

    = xTj .

    3.6 Multiple tests

    When many tests applied to the same data, there is a tendency for somep-values to be small: Suppose P1, . . . , Pm are the random P -values for mindependent tests at level (before seeing the data); for each i, suppose thatP (Pi ) = if the null hypothesis is true. But then the probability that atleast one of the null hypothesis is rejected if m independent tests are carriedout is

    1 P ( none rejected) = 1 (1 )m.

    Example: If = 0.05 and m = 10, then

    P ( at least one rejected |H0 true ) = 0.4012.

    Thus with high probability at least one significant result will be foundeven when all the null hypotheses are true.

    42

  • Bonferroni: The Bonferroni inequality gives that

    P (minPi |H0) mi=1

    P (Pi |H0) m.

    A cautious approach for an overall level is therefore to declare the mostsignificant of m test results as significant at level p only if min pi p/m.

    Example: If = 0.05 and m = 10, then reject only if the p-value is lessthan 0.005.

    3.7 Combining independent tests

    Suppose we have k independent experiments/studies for the same null hy-pothesis. If only the p-values are reported, and if we have continuous distribu-tion, we may use that under H0 each p-value is U [0, 1] uniformly distributed(Exercise). This gives that

    2ki=1

    logPi 22k

    (exactly) under H0, so

    pcomb = P(22k 2

    log pi).

    If each test is based on a statistic T such that Ti N (0, vi), then thebest combination statistic is

    Z =

    (Ti/vi)/

    v1i .

    If H0 is a hypothesis about a common parameter , then the best com-bination of evidence is

    `P,i(),

    and the combined test would be derived from this (e.g., an LR or score test).

    43

  • AdviceEven though a test may initially be focussed on departures in one direc-

    tion, it is usually a good idea not to totally disregard departures in the otherdirection, even if they are unexpected.

    Warning:Not rejecting the null hypothesis does not mean that the null hypothesis

    is true! Rather it means that there is not enough evidence to reject the nullhypothesis; the data are consistent with the null hypothesis.

    The p-value is not the probability that the null hypothesis is true.

    3.8 Nonparametric tests

    Sometimes we do not have a parametric model available, and the null hy-pothesis is phrased in terms of arbitrary distributions, for example concerningonly the median of the underlying distribution. Such tests are called non-parametric or distribution-free; treating these would go beyond the scope ofthese lectures.

    44

  • Chapter 4

    Interval estimation

    The goal for interval estimation is to specify the accurary of an estimate. A1 confidence set for a parameter is a set C(X) in the parameter space, depending only on X, such that

    P( C(X)) = 1 .

    Note: it is not that is random, but the set C(X) is.

    For a scalar we would usually like to find an interval

    C(X) = [l(X), u(X)]

    so that P( [l(X), u(X)]) = 1 . Then [l(X), u(X)] is an interval esti-

    mator or confidence interval for ; and the observed interval [l(x), u(x)] isan interval estimate. If l is or if u is +, then we have a one-sidedestimator/estimate. If l is , we have an upper confidence interval, if u is+, we have an lower confidence interval.

    Example: Normal, unknown mean and variance. Let X1, . . . , Xnbe a random sample from N (, 2), where both and 2 are unknown. Then(X )/(S/n) tn1 and so

    1 = P,2(X S/n

    tn1,1/2)= P,2(X tn1,1/2S/

    n X + tn1,1/2S/

    n),

    and so the (familiar) interval with end points

    X tn1,1/2S/n

    is a 1 confidence interval for .

    45

  • 4.1 Construction of confidence sets

    4.1.1 Pivotal quantities

    A pivotal quantity (or pivot) is a random variable t(X, ) whose distributionis independent of all parameters, and so it has the same distribution for all .

    Example: (X )/(S/n) in the example above has tn1-distribution ifthe random sample comes from N (, 2).

    We use pivotal quantities to construct confidence sets, as follows. Suppose is a scalar. Choose a, b such that

    P(a t(X, ) b) = 1 .

    Manipulate this equation to give P(l(X) u(X)) = 1 (if t is a

    monotonic function of ); then [l(X), u(X)] is a 1 confidence interval for.

    Example: Exponential random sample. Let X1, . . . , Xn be a ran-dom sample from an exponential distribution with unknown mean . Thenwe know that nX/ Gamma(n, 1). If the -quantile of Gamma(n, 1) isdenoted by gn, then

    1 = P(nX/ gn,) = P( nX/gn,).

    Hence [0, nX/gn,] is a 1 confidence interval for . Alternatively, we saythat nX/gn, is the upper 1 confidence limit for .

    4.1.2 Confidence sets derived from point estimators

    Suppose (X) is an estimator for a scalar , from a known distribution. Thenwe can take our confidence interval as

    [ a1, + b1]

    where a1 and b1 are chosen suitably.If N(, v), perhaps approximately, then for a symmetric interval

    choosea1 = b1 = z1/2

    v.

    46

  • Note: [ a1, + b1] is not immediately a confidence interval for if v depends on : in that case replace v() by v(), which is a furtherapproximation.

    4.1.3 Approximate confidence intervals

    Sometimes we do not have an exact distribution available, but normal ap-proximation is known to hold.

    Example: asymptotic normality of m.l.e. . We have seen that,under regularity, N (, I1()). If is scalar, then (under regularity)

    z1/2/In()

    is an approximate 1 confidence interval for .Sometimes we can improve the accuracy by applying (monotone) trans-

    formation of the estimator, using the delta method, and inverting the trans-formation to get the final result.

    As a guide line for transformations, in general a normal approximationshould be used on a scale where a quantity ranges over (,).

    Example: Bivariate normal distribution. Let (Xi, Yi), i = 1, . . . , n,be a random sample from a bivariate normal distribution, with unknownmean vector and covariance matrix. The parameter of interest is , thebivariate normal correlation. The MLE for is the sample correlation

    R =

    ni=1(Xi X)(Yi Y )n

    i=1(Xi X)2n

    i=1(Yi Y )2,

    whose range is [1, 1]. For large n,R N(, (1 2)2/n),

    using the expected Fisher information matrix to obtain an approximate vari-ance (see the section on asymptotic theory).

    But the distribution of R is very skewed, the approximation is poor un-less n is very large. For a variable whose range is (,), we use thetranformation

    Z =1

    2log[(1 +R)/(1R)];

    47

  • this transformation is called the Fisher z transformation. By the deltamethod,

    Z N(, 1/n)where = 1

    2log[(1 + )/(1 )]. So a 1 confidence interval for can

    be calculated as follows: for compute the interval limits Z z1/2/n,

    then transform these to the scale using the inverse transformation =(e2 1)/(e2 + 1).

    4.1.4 Confidence intervals derived from hypothesis tests

    Define C(X) to be the set of values of 0 for which H0 would not be rejectedin size- tests of H0 : = 0. Here the form of the 1 confidence setobtained depends on the alternative hypotheses.Example: to produce an interval with finite upper and lower limits use H1 : 6= 0; to find an upper confidence limit use H1 : < 0.

    Example: Normal, known variance, unknown mean. LetX1, . . . , Xnbe i.i.d. N(, 2), where 2 known. For H0 : = 0 versus H1 : 6= 0 theusual test has an acceptance region of the formX 0/n

    z1/2.So the values of 0 for which H0 is accepted are those in the interval

    [X z1/2/n,X + z1/2/

    n];

    this interval is a 100(1 )% confidence interval for .

    For H0 : = 0 versus H1 : < 0 the UMP test accepts H0 if

    X 0 z1/n

    i.e., if0 X + z1/

    n.

    So an upper 1 confidence limit for is X + z1/n.

    48

  • 4.2 Hypothesis test from confidence regions

    Conversely, we can also construct tests based on confidence interval:For H0 : = 0 against H1 : 6= 0, if C(X) is 100(1 )% two-sided

    confidence region for , then for a size test reject H0 if 0 6= C(X): Theconfidence region is the acceptance region for the test.

    If is a scalar: For H0 : = 0 against H1 : < 0, if C(X) is 100(1)%

    upper confidence region for , then for a size test reject H0 if 0 6= C(X).Example: Normal, known variance. Let X1, . . . , Xn N(, 2) be

    i.i.d., where 2 is known. For H0 : = 0 versus H1 : 6= 0 the usual100(1 )% confidence region is

    [X z1/2/n,X + z1/2/

    n],

    so reject H0 if X 0/n > z1/2.

    To test H0 : = 0 versus H1 : < 0: an upper 100(1)% confidence

    region is X + z1/2/n, so reject H0 if

    0 > X + z1/n

    i.e. ifX < 0 z1/

    n.

    We can also construct approximate hypothesis test based on approximateconfidence intervals. For example, we use the asymptotic normality of m.l.e.to derive a Wald test.

    4.3 Prediction Sets

    What is a set of plausible values for a future data value? A 1 predictionset for an unobserved random variable Xn+1 based on the observed dataX = (X1, . . . , Xn) is a random set P (X) for which

    P(Xn+1 P (X)) = 1 .

    49

  • Sometimes such a set can be derived by finding a prediction pivot t(X, Xn+1)whose distribution does not depend on . If a setR is such that P(t(X, Xn+1) R) = 1 , then a 1 prediction set is

    P (X) = {Xn+1 : t(X, Xn+1) R}.

    Example: Normal, unknown mean and variance. LetX1, . . . , Xn N(, 2) be i.i.d., where both and 2 are unknown. A possible predictionpivot is

    t(X, Xn+1) =Xn+1 XS

    1 + 1n

    .

    As X N(, 2n

    ) and Xn+1 N(, 2) is independent of X, it follows thatXn+1 X N(0, 2(1 + 1/n)), and so t(X, Xn+1) has tn1 distribution.Hence a 1 prediction interval is

    {Xn+1 : |t(X, Xn+1)| tn1,1/2}

    =

    {Xn+1 : X S

    1 +

    1

    ntn1,1/2 Xn+1 X + S

    1 +

    1

    ntn1,1/2

    }.

    50

  • Chapter 5

    Asymptotic Theory

    What happens as n?

    Let = (1, . . . , p) be the parameter of interest, let `() be the log-likelihood. Then `() is a vector, with jth component `/j, and In() isthe Fisher information matrix, whose (j, k) entry is E (2`/jk).

    5.1 Consistency

    A sequence of estimators Tn for , where Tn = tn(X1, . . . , Xn), is said to beconsistent if, for any > 0,

    P(|Tn | > ) 0 as n.

    In that case we also say that Tn converges to in probability.

    Example: the sample mean. Let Xn be and i.i.d. sample of size n,with finite variance, mean then, by the weak law of large numbers, Xn isconsistent for .

    Recall: The weak law of large numbers states: Let X1, X2, . . . be a se-quence of independent random variables with E(Xi) = and V ar(Xi) =

    2,and let Xn =

    1n

    ni=1Xi. Then, for any > 0,

    P (|Xn | > ) 0 as n.

    51

  • A sufficient condition for consistency is thatBias(Tn) 0 and Var(Tn)0 as n. (Use the Chebyshev inequality to show this fact).

    Subject to regularity conditions, MLEs are consistent.

    5.2 Distribution of MLEs

    Assume that X1, . . . , Xn are i.i.d. where scalar, and = (X) is the m.l.e.;assume that exists and is unique. In regular problems, is solution to thelikelihood equation `() = 0. Then Taylor expansion gives

    0 = `() `() + ( )`()and so

    `()In()

    ( ) `()In()

    . (5.1)

    For the left hand side of (5.1):

    `()/In() =

    Yi/(n)

    whereYi =

    2/2{log f(Xi; )}and = E(Yi). The weak law of large numbers gives that

    `()/In() 1in probability, as n. So

    `()In()

    .

    For the right hand side of (5.1),

    `() =

    /{log f(Xi; )}is the sum of i.i.d. random variables. By the CLT, `() is approximatelynormal with mean E[`()] = 0 and variance Var(`()) = In(), and hence`() N(0, In()) or

    `()/I() N(0, [In()]1). (5.2)

    52

  • Combining: N(0, [In()]1).

    Result:

    N(, [In()]1) (5.3)

    is the approximate distribution of the MLE.

    The above argument generalises immediately to being a vector: if has p components, say, then is approximately multivariate normal in p-dimensions with mean vector and covariance matrix [In()]

    1.In practice we often use In() in place of In().A corresponding normal approximation applies to any monotone trans-

    formation of by the delta method, as seen before.

    Back to our tests:1. Wald test2. Score test (LMP test)3. Generalised LR test.

    A normal approximation for the Wald test follows immediately from (5.3).

    5.3 Normal approximation for the LMP/score

    test

    Test H0 : = 0 against H+1 : > 0 (where is a scalar:) We reject H0

    if `() is large (in contrast, for H0 versus H1 : < 0, small values of `()

    would indicate H1 ). The score test statistic is `()/

    In(). From (5.2) we

    obtain immediately that

    `()/In() N (0, 1).

    To find an (approximate) rejection region for the test: use the normal ap-proximation at = 0, since the rejection region is calculated under theassumption that H0 is true.

    53

  • 5.4 Chi-square approximation for the gener-

    alised likelihood ratio test

    Test H0 : = 0 against H1 : 6= 0, where is scalar. Reject H0 ifL(; X)/L(; X) is large; equivalently, reject for large

    2 logLR = 2[`() `()].

    We use Taylor expansion around :

    `() `() ( )`() 12( )2`().

    Setting `() = 0, we obtain

    `() `() 12( )2`().

    By the consistency of , we may approximate

    `() In()to get

    2[`() `()] ( )2In() =(

    I1n ()

    )2.

    From (5.2), the asymptotic normality of , and as 21 variable is the squareof a N(0, 1) variable, we obtain that

    2 logLR = 2[`() `()] 21.We can calculate a rejection region for the test of H0 versus H1 under thisapproximation.

    For = (1, . . . , p), testing H0 : = 0 versus H1 : 6= 0, the dimensionof the normal limit for is p, hence the degrees of freedom of the relatedchi-squared variables are also p:

    `()T [In()]1`() 2pand

    2 logLR = 2[`() `()] 2p.

    54

  • 5.5 Profile likelihood

    Now = (, ), and is the MLE of when fixed. Recall that the profile

    log-likelihood is given by `P () = `(, ).

    5.5.1 One-sided score test

    We test H0 : = 0 against H+1 : > 0; we reject H0 based on large values

    of the score function T = `(, ). Again T has approximate mean zero.For the approximate variance of T , we expand

    T `(, ) + ( )`,(, ).From (5.1),

    I1n `.We write this as(

    )(I, I,I, I,

    )1(``

    ).

    Here ` = `/, ` = `/, `

    , =

    2`/, I, = E[`,] etc. Nowsubstitute I1,` and put

    `, I,.Calculate

    V (T ) I, + (I1,)2I2,I, 2I1,I,I,to get

    T ` I1,I,` N(0, 1/I,n ),where I,n = (I, I2,I1,)1 is the top left element of I1n . Estimatethe Fisher information by substituting the null hypothesis values. Finallycalculate the practical standardized form of T as

    Z =T

    Var(T ) `(, )[I,n (, )]1/2 N(0, 1).

    Similar results for vector-valued and vector-valued hold, with obviousmodifications, provided that the dimension of is fixed (i.e., independent ofthe sample size n).

    55

  • 5.5.2 Two-sided likelihood ratio tests

    Assume that and are scalars. We use similar arguments as above, in-cluding Taylor expansion, for

    2 logLR = 2[`(, ) `(, )

    ]to obtain

    2 logLR ( )2/I,n 21, (5.4)

    where I,n = (I,I2,I1,)1 is the top left element of I1n . The chi-squaredapproximation above follows from normal.

    (Details can be found in the additional material at the end of this section.)

    In general, if is p-dimensional, then 2 logLR 2p.

    Note: This result applies to the comparison of nested models, i.e., whereone model is a special case of the other, but it does not apply to the com-parison of non-nested models.

    5.6 Connections with deviance

    In GLMs, the deviance is usually 2 logLR for two nested models, one thesaturated model with a separate parameter for every response and the otherthe GLM (linear regression, log-linear model, etc.) For normal linear modelsthe deviance equals the RSS. The general chi-squared result above need notapply to the deviance, because has dimension np where p is the dimensionof the GLM.

    But the result does apply to deviance differences: Compare the GLM fitwith p parameters (comprising = (, )) to a special case with only q (< p)parameters (i.e., with omitted), then 2 logLR for that comparison is thedeviance difference, and in the null case (special case correct) 2pq.

    56

  • 5.7 Confidence regions

    We can construct confidence regions based on the asymptotic normal distri-butions of the score statistic and the MLE, or on the chi-square approxima-tion to the likelihood ratio statistic, or to the profile likelihood ratio statistic.These are equivalent in the limit n, but they may display slightly dif-ferent behaviour for finite samples.

    Example: Wald-type interval. Based on the asymptotic normalityof a p-dimensional , an approximate 1 confidence region is

    { : ( )T In()( ) 2p,1}.

    As an alternative to using In() we could use Jn(), the observed informationor observed precision evaluated at , where [Jn()]jk = 2`/jk.

    An advantage of the first type of region is that all values of insidethe confidence region have higher likelihood than all values of outside theregion.

    Example: normal sample, known variance. Let X1, . . . , Xn N(, 2) be i.i.d, with 2 known. The log LR difference is

    `(; x) `(; x) = 122

    [(xi x)2

    (xi )2

    ]=

    n(x )222

    ,

    so an approximate confidence interval is given by the values of satisfying

    n(X )222

    1221,1 or

    X /n z1/2,

    which gives the same interval as in Chapter 4. In this case the approximate2 result is, in fact, exact.

    5.8 Additional material: Derivation of (5.4)

    Assume and scalars, then(

    )(I, I,I, I,

    )1(``

    ).

    57

  • Similarly we have that

    I1,`.

    As` I,( ) + I,( ),

    we obtain

    + I,I1,( ).

    Taylor expansion gives

    2 logLR = 2[`(, ) `(, )

    ]= 2

    [`(, ) `(, )

    ] 2

    [`(, ) `(, )

    ] ( , )In( , )T (0, )In(0, )T .

    Substituting for gives

    2 logLR ( )2/I,n 21,

    where I,n = (I, I2,I1,)1 is the top left element of I1n . This is whatwe wanted to show.

    58

  • Part II

    Bayesian Statistics

    59

  • Chapter 6

    Bayesian Statistics:Background

    Frequency interpretation of probability

    In the frequency interpretation of probability, the probability of an eventis limiting proportion of times the event occurs in an infinite sequence ofindependent repetitions of the experiment. This interpretation assumes thatan experiment can be repeated!

    Problems with this interpretation:Independence is defined in terms of probabilities; if probabilities are de-

    fined in terms of independent events, this leads to a circular definition.How can we check whether experiments were independent, without doing

    more experiments?In practice we have only ever a finite number of experiments.

    Subjective probability

    Let P (A) denote your personal probability of an event A; this is a nu-merical measure of the strength of your degree of belief that A will occur, inthe light of available information.

    Your personal probabilities may be associated with a much wider classof events than those to which the frequency interpretation pertains. Forexample:

    60

  • - non-repeatable experiments (e.g. that England will win the World Cupnext time)

    - propositions about nature (e.g. that this surgical procedure results inincreased life expectancy) .

    All subjective probabilities are conditional, and may be revised in thelight of additional information. Subjective probabilities are assessments inthe light of incomplete information; they may even refer to events in the past.

    Axiomatic development

    Coherence:Coherence states that a system of beliefs should avoid internal inconsis-

    tencies. Basically, a quantitative, coherent belief system must behave as ifit was governed by a subjective probability distribution. In particular thisassumes that all events of interest can be compared.

    Note: different individuals may assign different probabilities to the sameevent, even if they have identical background information.

    (See Chapters 2 and 3 in Bernardo and Smith for fuller treatment offoundational issues.)

    Bayes Theorem

    Let B1, B2, . . . , Bk be a set of mutually exclusive and exhaustive events.For any event A with P (A) > 0,

    P (Bi|A) = P (Bi A)P (A)

    =P (A|Bi)P (Bi)kj=1 P (A|Bj)P (Bj)

    .

    Equivalently we write

    P (Bi|A) P (A|Bi)P (Bi).

    TerminologyP (Bi) is the prior probability of Bi;P (A|Bi) is the likelihood of A given Bi;

    61

  • P (Bi|A) is the posterior probability of Bi;P (A) is the predictive probability of A implied by the likelihoods and the

    prior probabilities.

    Example: Two events. Assume we have two events B1, B2, then

    P (B1|A)P (B2|A) =

    P (B1)

    P (B2) P (A|B1)P (A|B2) .

    If the data is relatively more probable under B1 than under B2, our belief inB1 compared to B2 is increased, and conversely.

    If B2 = Bc1:

    P (B1|A)P (Bc1|A)

    =P (B1)

    P (Bc1) P (A|B1)P (A|Bc1)

    .

    It follows that: posterior odds = prior odds likelihood ratio.

    Parametric modelsA Bayesian statistical model consists of1.) A parametric statistical model f(x|) for the data x, where a

    parameter; x may be multidimensional. - Note that we write f(x|) insteadof f(x, ) do emphasise the conditional character of the model. 2.) A priordistribution pi() on the parameter.

    Note: The parameter is now treated as random!

    The posterior distribution of given x is

    pi(|x) = f(x|)pi()f(x|)pi()d

    Shorter, we write: pi(|x) f(x|)pi() or posterior prior likeli-hood .

    Nuisance parameters

    Let = (, ), where is a nuisance parameter. Then pi(|x) = pi((, )|x).We calculate the marginal posterior of :

    pi(|x) =pi(, |x)d

    62

  • and continue inference with this marginal posterior. Thus we just integrateout the nuisance parameter.

    Prediction

    The (prior) predictive distribution of x on the basis pi is

    p(x) =

    f(x|)pi()d.

    Suppose data x1 is available, and we want to predict additional data:

    p(x2|x1) = p(x2, x1)p(x1)

    =

    f(x2, x1|)pi()df(x1|)pi()d

    =

    f(x2|, x1) f(x1|)pi()

    f(x1|)pi()dd

    =

    f(x2|) f(x1|)pi()

    f(x1|)pi()dd

    =

    f(x2|)pi(|x1)d.

    Note that x2 and x1 are assumed conditionally independent given . Theyare not, in general, unconditionally independent.

    Example

    (Bayes) A billard ball W is rolled from left to right on a line of length 1with a uniform probability of stopping anywhere on the line. It stops at p.A second ball O is then rolled n times under the same assumptions, and Xdenotes the number of times that the ball O stopped on the left of W . GivenX, what can be said about p?

    Our prior is (p) = 1 for 0 p 1; our model is

    P (X = x|p) =(n

    x

    )px(1 p)nx

    for x = 0, . . . , n

    63

  • We calculate the predictive distribution, for x = 0, . . . , n

    P (X = x) =

    10

    (n

    x

    )px(1 p)nxdp

    =

    (n

    x

    )B(x+ 1, n x+ 1)

    =

    (n

    x

    )x!(n x)!(n+ 1)!

    =1

    n+ 1,

    where B is the beta function,

    B(, ) =

    10

    p1(1 p)1dp = ()()( + )

    .

    We calculate the posterior distribution:

    pi(p|x) 1(n

    x

    )px(1 p)nx

    px(1 p)nx,

    so

    pi(p|x) = px(1 p)nx

    B(x+ 1, n x+ 1);

    this is the Beta(x+ 1, n x+ 1)-distribution.In particular the posterior mean is

    E(p|x) = 1

    0

    ppx(1 p)nx

    B(x+ 1, n x+ 1)dp

    =B(x+ 2, n x+ 1)B(x+ 1, n x+ 1)

    =x+ 1

    n+ 2.

    (For comparison: the mle is xn.)

    Further, P (O stops to the left of W on the next roll |x) is Bernoulli-distributed with probability of success E(p|x) = x+1

    n+2.

    64

  • Example: exponential model, exponential priorLet X1, . . . , Xn be a random sample with density f(x|) = ex for

    x 0, and assume pi() = e for 0; and some known . Then

    f(x1, . . . , xn|) = neni=1 xi

    and hence the posterior distribution is

    pi(|x) neni=1 xie

    ne(ni=1 xi+),

    which we recognize as Gamma(n+ 1, +n

    i=1 xi).

    Example: normal model, normal prior

    Let X1, . . . , Xn be a random sample from N (, 2), where 2 is known,and assume that the prior is normal, pi() N (, 2), where , 2 is known.Then

    f(x1, . . . , xn|) = (2pi2)n2 exp{1

    2

    ni=1

    (xi )22

    },

    so we calculate for the posterior that

    pi(|x) exp{1

    2

    (ni=1

    (xi )22

    +( )2 2

    )}=: e

    12M .

    We can calculate (Exercise)

    M = a

    ( b

    a

    )2 b

    2

    a+ c,

    a =n

    2+

    1

    2,

    b =1

    2

    xi +

    2,

    c =1

    2

    x2i +

    2

    2.

    65

  • So it follows that the posterior is normal,

    pi(|x) N(b

    a,

    1

    a

    )Exercise: the predictive distribution for x is N (, 2 + 2).Note: The posterior mean for is

    1 =12

    xi +

    2

    n2

    + 12

    .

    If 2 is very large compared to 2, then the posterior mean is approximately x.If 2/n is very large compared to 2, then the posterior mean is approximately.

    The posterior variance for is

    =1

    n2

    + 12

    < min

    {2

    n, 2},

    which is smaller than the original variances.

    Credible intervalsA (1 ) (posterior) credible interval is an interval of values within

    which 1 of the posterior probability lies. In the above example:

    P

    (z/2 < 1

    < z/2

    )= 1

    is a (1 ) (posterior) credible interval for .The equality is correct conditionally on x, but the rhs does not depend

    on x, so the equality is also unconditionally correct.If 2 then 2

    n 0, and 1 x 0, and

    P

    (x z/2

    n< < x+ z/2

    n

    )= 1 ,

    which gives the usual 100(1)% confidence interval in frequentist statistics.Note: The interpretation of credible intervals is different to confidence

    intervals: In frequentist statistics, xz/2 n applies before x is observed; therandomness relates to the distribution of x, whereas in Bayesian statistics,the credible interval is conditional on the observed x; the randomness relatesto the distribution of .

    66

  • Chapter 7

    Bayesian Models

    Sufficiency

    A sufficient statistic captures all the useful information in the data.Definition: A statistic t = t(x1, . . . , xn) is parametric sufficient for if

    pi(|x) = pi(|t(x)).

    Note: then we also have

    p(xnew|x) = p(xnew|t(x)).

    Factorization Theorem: t(x) is parametric sufficient if and only if

    pi(|x) = h(t(x), )pi()h(t(x), )pi()d

    for some function h.Recall: For classical suffiency, the factorization theorem gave as necessary

    and sufficient condition that

    f(x, ) = h(t(x), )g(x).

    Theorem: Classical sufficiency is equivalent to parametric sufficiency.

    67

  • To see this: assume classical sufficiency, then

    pi(|x) = f(x|)pi()f(x|)pi()d

    =h(t(x), )g(x)pi()h(t(x), )g(x)pi()d

    =h(t(x), )pi()h(t(x), )pi()d

    depends on x only through t(x), so pi(|x) = pi(|t(x)).Conversely, assume parametric sufficiency, then

    f(x|)pi()f(x)

    = pi(|x) = pi(|t(x)) = f(t(x)|)pi()f(t(x))

    and so

    f(x|) = f(t(x)|)f(t(x))

    f(x)

    which implies classical sufficiency.

    Example: A k-parameter exponential family is given by

    f(x|) = exp{

    ki=1

    cii()hi(x) + c() + d(x)

    }, x X

    where

    ec() =exp

    {ki=1

    cii()hi(x) + d(x)

    }dx

  • Exchangeability

    X1, . . . , Xn are (finitely) exchangeable if

    P (X1 E1, . . . , Xn En) = P (X(i) E1, . . . , X(n) En)

    for any permutation of {1, 2, . . . , n}, and any (measureable) sets E1, . . . , En.An infinite sequence X1, X2, . . . is exchangeable if every finite sequence is(finitely) exchangeable

    Intuitively: a random sequence is exchangeable if the random quantitiesdo not arise, for example, in a time ordered way.

    Every independent sequence is exchangeable, but NOT every exchange-able sequence is independent.

    Example: A simple random sample from a finite population (samplingwithout replacement) is exchangeable, but not independent.

    Theorem (de Finetti). If X1, X2, . . . is exchangeable, with probabilitymeasure P , then there exists a prior measure Q on the set of all distributions(on the real line) such that, for any n, the joint distribution function ofX1, . . . Xn has the form n

    i=1

    F (xi)dQ(F ),

    where F1, . . . , Fn are distributions, and

    Q(E) = limn

    P (Fn E),

    where Fn(x) =1n

    ni=1 1(Xi x) is the empirical c.d.f..

    Thus each exchangeable sequence arises from a 2-stage randomization:(a) pick F according to Q;(b) conditional on F , the observations are i.i.d. .

    De Finettis Theorem tells us that subjective beliefs which are consistentwith (infinite) exchangeability must be of the form

    (a) There are beliefs (a priori) on the parameter F; representing yourexpectations for the behaviour of X1, X2, . . .;

    69

  • (b) conditional on F the observations are i.i.d. .In Bayesian statistics, we can think of the prior distribution on the pa-

    rameter as such an F . If a sample is i.i.d. given , then the sample isexchangeable.

    Example. Suppose X1, X2, . . . are exchangeable 0-1 variables. Thenthe distribution of Xi is uniquely defined by p = P (Xi = 1); the set of allprobability distributions on {0, 1} is equivalent to the interval [0, 1]. Themeasure Q puts a probability on [0, 1]; de Finetti gives

    p(x1, . . . , xn) =

    pni=1 xi(1 p)n

    ni=1 xidQ(p).

    Not all finitely exchangeable sequences can be imbedded in an infiniteexchangeable sequence.

    Exercise: X1, X2 such that

    P (Xi = 1, X2 = 0) = P (X1 = 0, X2 = 1) =1

    2

    cannot be embedded in an exchangeable sequence X1, X2, X3.

    For further reading on de Finettis Theorem, see Steffen Lauritzens grad-uate lecture at www.stats.ox.ac.uk/~steffen/teaching/grad/definetti.pdf.

    70

  • Chapter 8

    Prior Distributions

    Let be a parameter space. How do we assign prior probabilities on ?Recall: we need a coherent belief system.

    In a discrete parameter space we assign subjective probabilities to eachelement of the parameter space, which is in principle straightforward.

    In a continuous parameter space:

    1. Histogram approach:Say, is an interval of R. We can discretize , assess the total mass

    assigned to each subinterval, and smoothen the histogram.

    2. Relative likelihood approach:Again, say, is an interval of R. We assess the relative likelihood that

    will take specific values. This relative likelihood is proportional to the priordensity. If is unbounded, normalization can be an issue.

    3. Particular functional forms: conjugate priors. A family F ofprior distributions for is closed under sampling from a model f(x|) if forevery prior distribution pi() F , the posterior pi(|x) is also in F .

    When this happens, the common parametric form of the prior and pos-terior are called a conjugate prior family for the problem. Then we also saythat the family F of prior distributions is conjugate to this class of models{f(x|), }.

    Often we abuse notation and call an element in the family of conjugatepriors a conjugate prior itself.

    71

  • Example: We have seen already: if X N (, 2) with 2 known; andif N (, 2), then the posterior for is also normal. Thus the family ofnormal distributions forms a conjugate prior family for this normal model.

    Example: Regular k-parameter exponential family. If

    f(x|) = exp{ki=1

    cii()hi(x) + c() + d(x)}, x X

    then the family of priors of the form

    pi(|) = (K())1exp{ki=1

    ciii() + 0c()},

    where = (0, . . . , k) is such that

    K() =

    e0c()exp{ki=1

    ciii()}d 1 and 0 1 >1, in which case it is the Beta(1 + 1, 0 1 + 1)-distribution. Thus thefamily of Beta distributions forms a conjugate prior family for the Bernoullidistribution.

    72

  • Example: Poisson distribution: Gamma prior. Here again k = 1,

    d(x) = log((x!)), c() = h1(x) = x, 1() = log ,

    and an element of the family of conjugate priors is given by

    pi(|) = 1K(0, 1)

    1e0 .

    The density will have a finite integral if and only if 1 > 1 and 0 > 0, inwhich case it is the Gamma(1 + 1, 0) distribution.

    Example: Normal, unknown variance: normal-gamma prior. Forthe normal distribution with mean , we let the precision be = 2, then = (, )

    d(x) = 12

    log(2pi), c(, ) = 2

    2+ log(

    )

    (h1(x), h2(x)) = (x, x2), (1(, ), 2(, )) = (,1

    2)

    and an element of the family of conjugate priors is given by

    pi(, |0, 1, 2) 02 exp

    {1

    220 + 1 1

    22

    } 0+12 1exp

    {1

    2

    (2

    21

    0

    )

    }

    12 exp

    {0

    2

    ( 1

    0

    )2}.

    The density can be interpreted as follows: Use a Gamma((0 + 1)/2, (2 21 /0)/2) prior for . Conditional on , we have a normal N (1/0, 1/(0))for . This is called a normal-gamma distribution for (, ); it will have afinite integral if and only if 2 >

    21 /0 and 0 > 0.

    Fact: In regular k-parameter exponential family models, the fam-ily of conjugate priors is closed under sampling. Moreover, if pi(|0, . . . , k)is in the above conjugate prior family, then

    pi(|x1, . . . , xn, 0, . . . , k) = pi(|0 + n, 1 +H1(x), . . . , k +Hk(x)),

    73

  • where

    Hi(x) =nj=1

    hi(xj).

    Recall that (n,H1(x), . . . , Hk(x)) is a sufficient statistic. Note that indeedthe posterior is of the same parametric form as the prior.

    The predictive density for future observations y = y1, . . . , ym is

    p(y|x1, . . . , xn, 0, . . . , k)= p(y|0 + n, 1 +H1(x), . . . , k +Hk(x))=

    K(0 + n+m, 1 +H1(x, y), . . . , k +Hk(x, y))

    K(0 + n, 1 +H1(x), . . . , k +Hk(x))em`1 d(y`),

    where

    Hi(x, y) =nj=1

    hi(xj) +m`=1

    hi(y`).

    This form is particularly helpful for inference: The effect of the datax1, . . . , xn is that the labelling parameters of the posterior are changed fromthose of the prior, (0, . . . , k) by simply adding the sufficient statistics

    (t0, . . . , tk) = (n,ni=1

    h1(xj), . . . ,ni=1

    hk(xj))

    to give the parameters (0 + t0, . . . , k + tk) for the posterior.

    Mixtures of priors from this conjugate prior family also lead to a simpleanalysis (Exercise).

    Noninformative priors

    Often we would like a prior that favours no particular values of the pa-rameter over others. If finite with || = n, then we just put mass 1

    nat

    each parameter value. If is infinite, there are several ways in which onemay seek a noninformative prior.

    Improper priors are priors which do not integrate to 1. They are inter-preted in the sense that posterior prior likelihood. Often they arise as

    74

  • natural limits of proper priors. They have to be handled carefully in ordernot to create paradoxes!

    Example: Binomial, Haldanes prior. Let X Bin(n, p), with nfixed, and let pi be a prior on p. Haldanes prior is

    pi(p) 1p(1 p) ;

    we have that 1

    0pi(p) dp =. This prior gives as marginal density

    p(x) 1

    0

    (p(1 p))1(n

    x

    )px(1 p)nxdp,

    which is not defined for x = 0 or x = n. For all other values of x we obtainthe Beta(x, n x)-distribution. The posterior mean is

    x

    x+ n x =x

    n.

    For x = 0, n: we could think of

    pi pi, Beta(, ),where , > 0 are small. Then the posterior is Beta( + x, + n x).Now let , 0: then pi, converges to pi. Note that the posterior meanconverges to x/n for all values of x.

    Noninformative priors for location parameters

    Let ,X be subsets of Euclidean space. Suppose that f(x|) is of theform f(x ): this is called a location density, is called location parameter.

    For a noninformative prior for : suppose that we observe y = x + c,where c fixed. If = + c then y has density f(y|) = f(y ), and so anoninformative priors should be the same (as we assume that we have thesame parameter space).

    Call pi the prior for the (x, )-problem, and pi the prior for the (y, )-problem. Then we want that for any (measureable) set A

    A

    pi()d =

    A

    pi()d =Ac

    pi()d =

    A

    pi( c)d,

    75

  • yieldingpi() = pi( c)

    for all . Hence pi() = pi(0) constant; usually we choose pi() = 1 for all .This is an improper prior.

    Noninformative priors for scale parameters

    A (one-dimensional) scale density is a density of the form

    f(x|) = 1f(x

    )for > 0; is called scale parameter. Consider y = cx, for c > 0; put = c, then y has density f(y|) = 1

    f(x

    ). Noninformative priors for and

    should be the same (assuming the same parameter space), so we want thatfor any (measureable) set A

    A

    pi()d =

    c1A

    pi()d =

    A

    pi(c1)c1d,

    yieldingpi() = c1pi(c1)

    for all > 0; pi(c) = c1pi(1); hence pi() 1. Usually we choose pi() = 1

    .

    This is an improper prior.

    Jeffreys Priors

    Example: Binomial model. Let X Bin(n, p).A plausible noninfor-mative prior would be p U [0, 1], but then p has higher density near 1than near 0. Thus ignorance about p seems to lead to knowledge aboutp, which is paradoxical. We would like the prior to be invariant under

    reparametrization.

    Recall: under regularity conditions, the Fisher information equals

    I() E(2logf(x|)

    2

    ).

    Under the same regularity assumptions, we define the Jeffreys prior as

    pi() I() 12 .

    76

  • Here I() = I1(). This prior may or may not be improper.

    Reparametrization: Let h be monotone and differentiable. The chain rulegives

    I() = I(h())

    (h

    )2.

    For the Jeffreys prior pi(), we have

    pi(h()) = pi()

    h1 I() 12 h

    1 = I(h()) 12 ;thus the prior is indeed invariant under reparametrization.

    The Jeffreys prior favours values of for which I() is large. Henceminimizes the effect of the prior distribution relative to the information inthe data.

    Exercise: The above non-informative priors for scale and location corre-spond to Jeffreys priors.

    Example: Binomial model, Jeffreys prior. Let X Bin(n, p);where n is known, so that f(x|p) = (n

    x

    )px(1 p)nx. Then

    2logf(x|p)2p

    = xp2 n x

    (1 p)2 .

    Take expectation and multiply by minus 1: I(p) = np

    + n(1p) =

    np(1p) Thus

    the Jeffreys prior ispi(p) (p(1 p)) 12 ,

    which we recognize as the Beta(1/2, 1/2)-distribution. This is a proper prior.

    If is multivariate, the Jeffreys prior is

    pi() [detI()] 12 ,which is still invariant under reparametrization.

    Example: Normal model, Jeffreys prior. Let X N (, 2), where = (, ). We abbreviate

    (x, , ) = log (x )2

    22;

    77

  • then

    I() = E(

    2

    2(x, , )

    2

    (x, , )

    2

    (x, , )

    2

    2(x, , )

    )

    = E( 12

    2(x)3

    2(x)3

    12 3(x)2

    4

    )=

    (12

    00 2

    2

    ).

    So

    pi() (

    1

    2 22

    ) 12

    12