18.650 fundamentals of statistics 26mm 2. foundations of ...mitx+18.6501x+3t2018+type@asse… ·...

61
1/61 18.650 – Fundamentals of Statistics 2. Foundations of Inference

Upload: others

Post on 25-Jan-2021

40 views

Category:

Documents


1 download

TRANSCRIPT

  • 1/61

    18.650 – Fundamentals of Statistics

    2. Foundations of Inference

  • 2/61

    Goals

    In this unit, we introduce a mathematical formalization ofstatistical modeling to make a principled sense of the Trinity ofstatistical inference.We will make sense of the following statements:

    1. Estimation:

    ”p̂ = R̄n is an estimator for the proportion p ofcouples that turn their head to the right”

    (side question: is 64.5% also an estimator for p?)

    2. Confidence intervals:

    ”[0.56, 0.73] is a 95% confidence interval for p”

    3. Hypothesis testing:

    ”We find statistical evidence that more couples turntheir head to the right when kissing”

  • 3/61

    The rationale behind statistical modeling

    I Let X1, . . . , Xn be n independent copies of X.

    I The goal of statistics is to learn the distribution of X.

    I If X ∈ {0, 1}, easy! It’s Ber(p) and we only have tolearn the parameter p of the Bernoulli distribution.

    I Can be more complicated. For example, here is a (partial)dataset with number of siblings (including self) that werecollected from college students a few years back: 2, 3, 2, 4, 1,3, 1, 1, 1, 1, 1, 2, 2, 3, 2, 2, 2, 3, 2, 1, 3, 1, 2, 3, . . .

    I We could make no assumption and try to learn the pmf:

    x 1 2 3 4 5 6 ≥ 7IP(X = x) p1 p2 p3 p4 p5 p6

    ∑i≥7 pi

    That’s 7 parameters to learn.

    I Or we could assume that X − 1 ∼ Poiss(λ). That’s 1parameter to learn!

  • 4/61

    Statistical model

    Formal definition

    Let the observed outcome of a statistical experiment be a sampleX1, . . . , Xn of n i.i.d. random variables in some measurable spaceE (usually E ⊆ IR) and denote by IP their common distribution. Astatistical model associated to that statistical experiment is a pair

    (E, (IPθ)θ∈Θ) ,

    where:

    I E is called sample space

    I (IPθ)θ∈Θ is a family of probability measures on E;

    I Θ is any set, called parameter set.

  • 5/61

    Parametric, nonparametric and semiparametric models

    I Usually, we will assume that the statistical model is wellspecified, i.e., defined such that IP = IPθ, for some θ ∈ Θ.

    I This particular θ is called the true parameter , and isunknown: The aim of the statistical experiment is to estimateθ, or check it’s properties when they have a special meaningθ > 2?, θ 6= 1/2?, . . . )

    I We often assume that Θ ⊆ IRd for some d ≥ 1: The model iscalled parametric.

    I Sometime we could have Θ be infinite dimensional in whichcase the model is called nonparametric.

    I If Θ = Θ1 ×Θ2, where Θ1 is finite dimensional and Θ2 isinfinite dimensional: semiparametric model. In these modelswe only care to estimate the finite dimensional parameter andthe infinite dimensional one is called nuisance parameter. Wewill not cover such models in this class.

  • 6/61

    Examples of parametric models

    1. For n Bernoulli trials:({0, 1} , (Ber(p))p∈(0,1)

    ).

    2. If X1, . . . , Xniid∼ Poiss(λ) for some unknown λ > 0,(IN , (Poiss(λ))λ>0

    ).

    3. If X1, . . . , Xniid∼ N (µ, σ2), for some unknown µ ∈ IR and

    σ2 > 0:(

    IR ,(N (µ, σ2)

    )(µ,σ2)∈ IR×(0,∞)

    ).

    4. If X1, . . . , Xniid∼ Nd(µ, Id), for some unknown µ ∈ IRd:

    (IRd , (Nd(µ, Id))(µ∈ IR×(0,∞))

    ).

  • 7/61

    Examples of nonparametric models

    1. If X1, . . . , Xn ∈ IR are i.i.d with unknown unimodal1 pdf f :

    E = Θ = IPθ

    2. If X1, . . . , Xn ∈ IR are i.i.d with unknown pdf 1-Lipschitz pdff , i.e.

    |f(x)− f(y)| ≤ |x− y|for all x, y ∈ IR.

    3. If X1, . . . , Xn ∈ [0, 1] are i.i.d with unknown invertible cdf F .

    1Increases on (−∞, a) and then decreases on (a,∞) for some a > 0.

  • 8/61

    Further examplesSometimes we do not have simple notation to write (IPθ)θ∈Θ, e.g.,(Ber(p))p∈(0,1) and we have to be more explicit:

    1. Linear regression model: If(X1, Y1), . . . , (Xn, Yn) ∈ IRd × IR are i.i.d from the linearregression model Yi = β

    >Xi + εi εiiid∼ N (0, 1) for an

    unknown β ∈ IRd and Xi ∼ Nd(0, Id) independent of εi

    E = Θ = IPθ

    2. Cox proportional Hazard model: If(X1, Y1), . . . , (Xn, Yn) ∈ IRd × IR : the conditionaldistribution of Y given X = x has CDF F of the form

    F (t) = 1− exp(−∫ t

    0h(u)e(β

    >x)du

    )

    where h is an unknown non-negative nuisance function andβ ∈ IRd is the parameter of interest.

  • 9/61

    Identifiability

    The parameter θ is called identifiable iff the map θ ∈ Θ 7→ IPθ isinjective, i.e.,

    θ 6= θ′ ⇒ IPθ 6= IPθ′or equivalently:

    IPθ = IPθ′ ⇒ θ = θ′

    Examples

    1. In all four previous examples, the parameter is identifiable.

    2. If Xi = 1IYi≥0 (indicator function), Y1, . . . , Yniid∼ N (µ, σ2),

    for some unknown µ ∈ IR and σ2 > 0, are unobserved: µ andσ2 are not identifiable (but θ = µ/σ is).

  • 10/61

    Exercises

    a) Which of the following is a statistical model?

    1.({1}, (Ber(p))p∈(0,1)

    )

    2.({0, 1}, (Ber(p))p∈(0.2,0.4)

    )

    3. Both 1 and 2

    4. None of the above

    (answer: 2)

    b) Let X1, . . . , Xniid∼ U([0, a]) for some unknown a > 0. Which

    one of the following is the associated statistical model?

    1.([0, a], (U([0, a]))a>0

    )

    2.(IR+, (U([0, a]))a>0

    )

    3.(IR, (U([0, a]))a>0

    )

    4. None of the above

    (answer: 2)

  • 11/61

    Exercises

    c) Let Xi = Y 2i , where Y1, . . . , Yniid∼ U([0, a]), for some unknown

    a, are unobserved. Is a identifiable?

    1. Yes

    2. No

    (answer:yes)

    d) Let Xi = 1IYi≥a2 , where Y1, . . . , Yniid∼ U([0, a]), for some

    unknown a, are unobserved. Is a identifiable?

    1. Yes

    2. No

    (answer:no)

  • 12/61

    Estimation

  • 13/61

    Parameter estimation

    I Statistic : Any measurable2 function of the sample, e.g.,X̄n,max

    iXi, X1 + log(1 + |Xn|), sample variance, etc...

    I Estimator of θ: Any statistic whose expression does notdepend on θ.

    I An estimator θ̂n of θ is weakly (resp. strongly)if

    θ̂nIP (resp. a.s.)−−−−−−−−−→

    n→∞θ (w.r.t. IPθ).

    I An estimator θ̂n of θ is asymptotically normal if

    √n(θ̂n − θ)

    (d)−−−→n→∞

    The quantity σ2 is then called asymptotic variance

    2Rule of thumb: if you can compute it exactly once given data, it ismeasurable. You may have some issues with things that are implicitly definedsuch as sup or inf but not in this class

  • 14/61

    Bias of an estimator

    I Bias of an estimator θ̂n of θ:

    bias(θ̂n) = IE[θ̂n

    ]− θ.

    I If bias(θ̂) = 0, we say that θ̂ is unbiased

    I Example: Assume that X1, . . . , Xniid∼ Ber(p) and consider the

    following estimators for p:

    I p̂n = X̄n: bias(p̂n) =

    I p̂n = X1: bias(p̂n) =

    I p̂n =X1 +X2

    2: bias(p̂n) =

    I p̂n =√

    1I(X1 = 1, X2 = 2)

  • 15/61

    Variance of an estimator

    An estimator is a random variable so we can compute its variance.In the previous examples:

    I p̂n = X̄n: var(p̂n) =

    I p̂n = X1: var(p̂n) =

    I p̂n =X1 +X2

    2: var(p̂n) =

    I p̂n =√

    1I(X1 = 1, X2 = 2)

  • 16/61

    Quadratic risk

    I We want estimators to have low bias and low variance at thesame time.

    I The Risk (or quadratic risk) of an estimator θ̂n ∈ IR is

    R(θ̂n) = IE[|θ̂n − θ|2

    ]

    I Low quadratic risk means that both bias and variance aresmall:

    quadratic risk= bias2 + variance

  • 17/61

    Exercises

    Let X1, X2 . . . , Xn be a random sample from U([a, a+ 1]).a) Find IE

    [X̄n]

    (answer: a+ 12)b) Is X̄n − 12 an unbiased estimator for a?(answer:Yes)c) Find the variance of X̄n − 12 .(answer: var(X̄n − 12) = var(X̄n) =

    var(X1)n =

    112n)

    d) Find the quadratic risk of X̄n − 12 . (answer: bias=0 so theanswer is the same as in c)).

  • 18/61

    Confidence intervals

  • 19/61

    Confidence intervals

    Let (E, (IPθ)θ∈Θ) be a statistical model based on observationsX1, . . . , Xn, and assume Θ ⊆ IR. Let α ∈ (0, 1).

    I Confidence interval (C.I.) of level 1− α for θ: Any random(depending on X1, . . . , Xn) interval I whose boundaries donot depend on θ and such that3:

    IPθ [I 3 θ] ≥ 1− α, ∀θ ∈ Θ.I C.I. of asymptotic level 1− α for θ: Any random interval I

    whose boundaries do not depend on θ and such that:

    limn→∞

    IPθ [I 3 θ] ≥ 1− α, ∀θ ∈ Θ.

    3I 3 θ means that I contains θ. This notation emphasizes the randomnessof I but we can equivalently write θ ∈ I.

  • 20/61

    A confidence interval for the kiss example

    I Recall that we observe R1, . . . , Rniid∼ Ber(p) for some

    unknown p ∈ (0, 1).I Statistical model:

    ({0, 1}, (Ber(p))p∈(0,1)

    ).

    I Recall that our estimator for p is p̂ = R̄n.

    I From CLT:√n

    R̄n − p√p(1− p)

    (d)−−−→n→∞

    N (0, 1)

    This means (precisely) that:

    I Φ(x): cdf of N (0, 1); Φn(x): cdf of√n

    R̄n − p√p(1− p)

    .

    I Then: Φn(x) ≈ Φ(x) (CLT) when n becomes large. Hence, forall x > 0,

    IP[|R̄n − p| ≥ x

    ]' 2

    (1− Φ

    (x√n√

    p(1− p)

    )).

  • 21/61

    Confidence interval?

    I For a fixed α ∈ (0, 1), if qα/2 is the (1− α/2)-quantile ofN (0, 1), then with probability ' 1−α (if n is large enough !),

    R̄n ∈[p−

    qα/2√p(1− p)√n

    , p+qα/2

    √p(1− p)√n

    ].

    I More precisely

    limn→∞

    IP

    ([R̄n −

    qα/2√p(1− p)√n

    , R̄n +qα/2

    √p(1− p)√n

    ]3 p)

    = 1− α

    I But this is not a confidence interval because

    I To fix this, there are 3 solutions.

  • 22/61

    Solution 1: Conservative bound

    I Note that no matter the (unknown) value of p,

    p(1− p) ≤ 1/4

    I Hence, asymptotically with probability at least 1− α,

    R̄n ∈[p−

    qα/2

    2√n, p+

    qα/2

    2√n

    ].

    I We get the asymptotic confidence interval:

    Iconserv =[R̄n −

    qα/2

    2√n, R̄n +

    qα/2

    2√n

    ]

    I Indeedlimn→∞

    IP(Iconserv 3 p)≥ 1− α

  • 23/61

    Solution 2: Solving the (quadratic) equation for pI We have the system of two inequalities in p:

    R̄n −qα/2

    √p(1− p)√n

    ≤ p ≤ R̄n +qα/2

    √p(1− p)√n

    I Each is a quadratic inequality in p of the form

    (p− R̄n)2 ≤q2α/2p(1− p)

    n

    We need to find the roots p1 < p2 of

    (1 +

    q2α/2

    n

    )p2 −

    ( )p+ R̄2n = 0

    I This leads to a new confidence interval Isolve = [p1, p2] suchthat:

    limn→∞

    IP(Isolve 3 p)= 1− α

    (it’s complicated to write in generic way so let us wait to havevalues for n, α and R̄n to plug-in)

  • 24/61

    Solution 3: plug-in

    I Recall that by the LLN p̂ = R̄nIP,a.s.−−−→n→∞

    p

    I So by Slutsky, we also have

    √n

    R̄n − p√p(1− p)

    =√n

    R̄n − p√p̂(1− p̂)

    ·√p(1− p)√p̂(1− p̂)

    (d)−−−→n→∞

    N (0, 1)

    I This leads to a new confidence interval:

    Iplug-in =[R̄n −

    qα/2√p̂(1− p̂)√n

    , R̄n +qα/2

    √p̂(1− p̂)√n

    ]

    such thatlimn→∞

    IP(Iplug-in 3 p)= 1− α

  • 25/61

    95% asymptotic CI for the kiss example

    Recall that in the kiss example we had n = 124 and R̄n = 0.645.Assume α = 5%.For Isolve, we have to find the roots of:

    1.03p2 − 1.32p+ 0.41 = 0 p1 = 0.53, p2 = 0.75

    We get the following confidence intervals of asymptotic level 95%:

    I Iconserv =[0.56 , 0.73

    ]

    I Isolve =[0.53 , 0.75

    ]

    I Iplug-in =[0.56 , 0.73

    ]

    There are many4 other possibilities in softwares even ones that usethe exact distribution of nR̄n ∼

    IR default =[0.55 , 0.73

    ]

    4See R. Newcombe (1998). Two-Sided Confidence Intervals for the SingleProportion: Comparison of Seven Methods.

  • 26/61

    Exercises

    a) Let I, J be some 95% and 98% asymptotic confidence intervals(respectively) for p. Which one of the following statements iscorrect?

    1. We always have I ⊂ J .2. We always have J ⊂ I.3. None of the above.

    b) Find a 98% asymptotic confidence interval for p.

  • 27/61

    Exercises

    c) Consider a new experiment in which there are 150 participants,75 turned left and 75 turned right. Which of the following is thecorrect answer?

    1. [0, 0.5] is a 50% asymptotic confidence intervals for p

    2. [0.5, 1] is a 50% asymptotic confidence intervals for p

    3. [0.466, 0.533] is a 50% asymptotic confidence intervals for p

    4. [0.48, 0.52] is a 50% asymptotic confidence intervals for p

    5. both (1) and (2)

    6. (1), (2) and (3)

    7. (1), (2), (3) and (4)

  • 28/61

    Exercisesd) If [0.34, 0.57] is a 95% confidence interval for an unknownproportion p, then the probability that p ∈ [0.34, 0.57] is

    1. 0.025

    2. 0.05

    3. 0.95

    4. None of the above

    (answer:4)e) If [0.34, 0.57] is a 95% confidence interval for an unknownproportion p, is it also a 98% confidence interval?

    1. Yes

    2. No

    (answer:No)f) If [0.34, 0.57] is a 95% confidence interval for an unknownproportion p, is it also a 90% confidence interval?

    1. Yes

    2. No

    (answer:Yes)

  • 29/61

    Another example: The T

  • 30/61

    Statistical problem

    I You observe the times (in minutes) between arrivals of the Tat Kendall: T1, . . . , Tn.

    I You assume that these times are:

    I Mutually independent

    I Exponential random variables with common parameter λ > 0.

    I You want to estimate the value of λ, based on the observedarrival times.

  • 31/61

    Discussion of the modeling assumptions

    I Mutual independence of T1, . . . , Tn: plausible but notcompletely justified (often the case with independence).

    I T1, . . . , Tn are exponential r.v.: lack of memory of theexponential distribution:

    IP[T1 > t+ s|T1 > t] = IP[T1 > s], ∀s, t ≥ 0.

    Also, Ti > 0 almost surely!

    I The exponential distributions of T1, . . . , Tn have the sameparameter: in average all the same inter-arrival time. Trueonly for limited period (rush hour 6= 11pm).

  • 32/61

    Estimator

    I Density of T1:f(t) = λe−λt, ∀t ≥ 0.

    I IE[T1] =1

    λ.

    I Hence, a natural estimate of1

    λis

    T̄n :=1

    n

    n∑

    i=1

    Ti.

    I A natural estimator of λ is

    λ̂ :=1

    T̄n.

  • 33/61

    First properties

    I By the LLN’s,

    T̄na.s./IP−−−−→n→∞

    1

    λ

    I Hence,

    λ̂a.s./IP−−−−→n→∞

    λ.

    I By the CLT,

    √n

    (T̄n −

    1

    λ

    )(d)−−−→

    n→∞N (0, λ−2).

    I How does the CLT transfer to λ̂ ? How to find an asymptoticconfidence interval for λ ?

  • 34/61

    The Delta method

    Let Zn be a sequence of random variables such that

    √n(Zn − θ)

    (d)−−−→n→∞

    N (0, σ2),

    for some θ ∈ IR and σ2 > 0 (asymptotically normal).

    Let g : IR→ IR be continuously differentiable at the point θ.Then,

    I g(Zn)IP−−−→

    n→∞I (g(Zn))n≥1 is also asymptotically normal;

    I More precisely,

    √n (g(Zn)− g(θ))

    (d)−−−→n→∞

    N (0, g′(θ)2σ2).

  • 35/61

    Consequence of the Delta method

    I√n(λ̂− λ

    )(d)−−−→

    n→∞N (0, λ2).

    I Hence, for α ∈ (0, 1) and when n is large enough,

    |λ̂− λ| ≤qα/2λ√n.

    with probability approximately 1− α.

    I Can

    [λ̂−

    qα/2λ√n, λ̂+

    qα/2λ√n

    ]be used as an asymptotic

    confidence interval for λ ?No ! It depends on λ...

  • 36/61

    Three “solutions”

    1. The conservative bound: we have no a priori way to bound λ

    2. We can solve for λ:

    |λ̂− λ| ≤qα/2λ√n⇐⇒ λ

    (1−

    qα/2√n

    )≤ λ̂ ≤ λ

    (1 +

    qα/2√n

    )

    ⇐⇒ λ̂(

    1 +qα/2√n

    )−1≤ λ ≤ λ̂

    (1−

    qα/2√n

    )−1.

    It yields

    Isolve =[λ̂

    (1 +

    qα/2√n

    )−1, λ̂

    (1−

    qα/2√n

    )−1]

    3. Plug-in yields

    Iplug-in =[λ̂

    (1−

    qα/2√n

    ), λ̂

    (1 +

    qα/2√n

    )]

  • 37/61

    95% asymptotic CI for the Kendall T example

    Assume that n = 64 and T̄n = 6.23 and α = 5%.

    We get the following confidence intervals of asymptotic level 95%:

    I Isolve =[0.13 , 0.21

    ]

    I Iplug-in =[0.12 , 0.20

    ]

  • 38/61

    Meaning of a confidence interval

    Take Iplug-in =[0.12 , 0.20

    ]for example. What is the meaning of

    “Iplug-in is a confidence intervals of asymptotic level 95%”.

    Does it mean that

    limn→∞

    IP(λ ∈=

    [0.12 , 0.20

    ])≥ .95?

    There is a frequentist interpretation5:If we were to repeat this experiment(collect 64 observations) then λ wouldbe in the resulting confidence intervalabout of the time (imagecredit: openintro.org).

    4.2. CONFIDENCE INTERVALS 175

    4.2.2 An approximate 95% confidence interval

    Our point estimate is the most plausible value of the parameter, so it makes sense to buildthe confidence interval around the point estimate. The standard error, which is a measureof the uncertainty associated with the point estimate, provides a guide for how large weshould make the confidence interval.

    The standard error represents the standard deviation associated with the estimate, androughly 95% of the time the estimate will be within 2 standard errors of the parameter.If the interval spreads out 2 standard errors from the point estimate, we can be roughly95% confident that we have captured the true parameter:

    point estimate ± 2 ⇥ SE (4.8)

    But what does “95% confident” mean? Suppose we took many samples and built a confi-dence interval from each sample using Equation (4.8). Then about 95% of those intervalswould contain the actual mean, µ. Figure 4.8 shows this process with 25 samples, where24 of the resulting confidence intervals contain the average number of days per week thatYRBSS students are physically active, µ = 3.90 days, and one interval does not.

    µ = 3.90

    ●●●

    Figure 4.8: Twenty-five samples of size n = 100 were taken from yrbss.For each sample, a confidence interval was created to try to capture theaverage number of days per week that students are physically active. Only 1of these 25 intervals did not capture the true mean, µ = 3.90 days.

    JGuided Practice 4.9 In Figure 4.8, one interval does not contain 3.90 days. Doesthis imply that the mean cannot be 3.90?9

    The rule where about 95% of observations are within 2 standard deviations of themean is only approximately true. However, it holds very well for the normal distribution.As we will soon see, the mean tends to be normally distributed when the sample size issu�ciently large.

    9Just as some observations occur more than 2 standard deviations from the mean, some point estimateswill be more than 2 standard errors from the parameter. A confidence interval only provides a plausiblerange of values for a parameter. While we might say other values are implausible based on the data, thisdoes not mean they are impossible.

    5The frequentist approach is often contrasted with the Bayesian approach.

  • 39/61

    Hypothesis testing

  • 40/61

    How to board a plane?

  • 41/61

    What is the fastest boarding method?

    What is the fastest method to board a plane?

    R2F or WilMA?

    I R2F= Rear to Front

    I WilMA=Window, Middle, Aisle. It is basically an OUTSIDEto INSIDE method.

  • 42/61

    The data

    We collected data from two different airlines: JetBlue (R2F) andUnited (WilMA).We got the following results:

    R2F WilMA

    Average (mins) 24.2 15.9

    Std. Dev (mins) 2.1 1.3

    Sample size 72 56

  • 43/61

    Model and Assumptions

    I Let X (resp. Y ) denote the boarding time of a randomJetBlue (resp. United) flight.

    I We assume that X ∼ N (µ1, σ21) and Y ∼ N (µ2, σ22)I Let n and m denote the JetBlue and United sample sizes

    respectively.

    I We have X1, . . . , Xn independent copies of X and Y1, . . . Ymindependent copies of Y .

    I We further assume that the two samples are independent.

    We want to answer the question:

    Is µ1 = µ2 or is µ1 > µ2?

    By making modeling assumptions, we have reduced the numberof ways the hypothesis µ1 = µ2 may be rejected. We do not allowthat µ1 < µ2!

    We have two samples: this is a two-sample test

  • 44/61

    A first heuristic

    Simple heuristic:

    “If X̄n > Ȳm, then µ1 > µ2”

    This could go wrong if I randomly pick only full flights in mysample X1, . . . , Xn and empty flights in my sample Y1, . . . , Ym.

    Better heuristic: If

    X̄n− > Ȳm+

    then µ1 > µ2”

    To make this intuition more precise, we need to take the size of therandom fluctuations of X̄n and Ȳm into account!

  • 45/61

    Waiting time in the ER

    I The average waiting time in the Emergency Room (ER) in theUS is 30 minutes according to the CDC

    I Some patients claim that the new Princeton-Plainsborohospital has a longer waiting time. Is it true?

    I Here, we collect only one sample: X1, . . . , Xn (waiting time inminutes for n random patients) with unknown expected valueIE[X1] = µ.

    I We want to know if µ > 30.

    This is a one-sample test

  • 46/61

    Heuristic

    Heuristic: If

    X̄n> 30 + buffer

    then conclude that µ > 30”

  • 47/61

    Example 1

    According to a survey conducted in 2017 on 4,971 randomlysampled Americans, 32% report to get at least some of their newson Youtube. Can we conclude that at most a third of allAmericans get at least some of their news on Youtube?

    I n = 4, 971, X1, . . . , Xniid∼ Ber(p);

    I X̄n = 0.32

    I If it was true that p = .33: By CLT,

    √n

    X̄n − .33√.33(1− .33)

    ≈ N (0, 1).

    I√n

    X̄n − .33√.33(1− .33)

    ≈ −1.50

    I Conclusion: It does not seem reasonable to reject theassumption that p = .33.

  • 48/61

    The Standard Gaussian distribution

    µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ

    99.7%

    95%

    68%

  • 49/61

    Example 2

    Example 2: A coin is tossed 30 times, and Heads are obtained 13times. Can we conclude that the coin is significantly unfair ?

    I n = 30, X1, . . . , Xniid∼ Ber(p);

    I X̄n = 13/30 ≈ .43I If it was true that p = .5: By CLT,

    √n

    X̄n − .5√.5(1− .5)

    ≈ N (0, 1).

    I Our data gives√n

    X̄n − .5√.5(1− .5)

    ≈ −.77

    I The number −.77 is a plausible realization of a randomvariable Z ∼ N (0, 1).

    I Conclusion: our data does not suggest that the coin is unfair.

  • 50/61

    Statistical formulation

    I Consider a sample X1, . . . , Xn of i.i.d. random variables and astatistical model (E, (IPθ)θ∈Θ).

    I Let Θ0 and Θ1 be disjoint subsets of Θ.

    I Consider the two hypotheses:

    {H0 : θ ∈ Θ0H1 : θ ∈ Θ1

    I H0 is the null hypothesis, H1 is the alternative hypothesis.

    I If we believe that the true θ is either in Θ0 or in Θ1, we maywant to test H0 against H1.

    I We want to decide whether to reject H0 (look for evidenceagainst H0 in the data).

  • 51/61

    Asymmetry in the hypotheses

    I H0 and H1 do not play a symmetric role: the data is is onlyused to try to disprove H0

    I In particular lack of evidence, does not mean that H0 is true(“innocent until proven guilty”)

    I A test is a statistic ψ ∈ {0, 1} such that:I If ψ = 0, H0 is not rejected;I If ψ = 1, H0 is rejected.

    I Kiss example: H0: p = 1/2 vs. H1: p 6= 1/2.

    I ψ = 1I{∣∣√n X̄n − .5√

    .5(1− .5)∣∣ > C

    }, for some C > 0.

    I How to choose the threshold C ?

  • 52/61

    Errors

    I Rejection region of a test ψ:

    Rψ = {x ∈ En : ψ(x) = 1}.

    I Type 1 error of a test ψ (rejecting H0 when it is actuallytrue):

    αψ : Θ0 → IRθ 7→ IPθ[ψ = 1].

    I Type 2 error of a test ψ (not rejecting H0 although H1 isactually true):

    βψ : Θ1 → IRθ 7→ IPθ[ψ = 0].

    I Power of a test ψ:

    πψ = infθ∈Θ1

    (1− βψ(θ)) .

  • 53/61

    Level, test statistic and rejection region

    I A test ψ has level α if

    αψ(θ) ≤ α, ∀θ ∈ Θ0.

    I A test ψ has asymptotic level α if

    αψ(θ) ≤ α, ∀θ ∈ Θ0.

    I In general, a test has the form

    ψ = 1I{Tn > c},

    for some statistic Tn and threshold c ∈ IR.

    I Tn is called the test statistic. The rejection region isRψ = {Tn > c}.

  • 54/61

    One-sided vs two-sided tests

    We can refine the terminology when θ ∈ Θ ⊂ IR and H0 is of theform

    H0 : θ = θ0 ⇔ Θ0 = {θ0}

    I If H1 : θ 6= θ0: two-sided testI If H1 : θ > θ or H1 : θ < θ0: one-sided test

    Examples:

    I Boarding method:

    I Waiting time in the ER:

    I The kiss example:

    I Fair coin:

    One or two sided tests will have different rejection regions.

  • 55/61

    Bernoulli experiment

    I Let X1, . . . , Xniid∼ Ber(p), for some unknown p ∈ (0, 1).

    I We want to test:

    H0: p = 1/2 vs. H1: p 6= 1/2

    with asymptotic level α ∈ (0, 1).

    I Let Tn =

    ∣∣∣∣∣√n

    p̂n − 0.5√.5(1− .5)

    ∣∣∣∣∣, where p̂n is the MLE.

    I If H0 is true, then by CLT,

    IP[Tn > qα/2] −−−→n→∞

    0.05

    I Let ψα = 1I{Tn > qα/2}.

  • 56/61

    Examples

    For α = 5%, qα/2 = 1.96 and qα/2 =

    Fair coin

    H0 is is not rejected at the asymptotic level 5% by the test ψ5%.

    News on Youtube

    H0 : p ≥ 0.33 vs. H1 : p < 0.33. This is a -sided test.We reject if:

    √n

    p̂n − p√p(1− p)

    < c

    But what value for p ∈ Θ0 = should we choose?Type 1 error is the function p 7→ IPp[ψ = 1]. To control the levelwe need to find the p that maximizes it over Θ0→ no need for computations, it’s clearly p =H0 is is not rejected at the asymptotic level 5% by the test ψ5%.

  • 57/61

    p-value

    Definition

    The (asymptotic) p-value of a test ψα is the smallest (asymptotic)level α at which ψα rejects H0. It is random, it depends on thesample.

    Golden rule

    p-value ≤ α ⇔ H0 is rejected by ψα, at the (asymptotic) level α.

    The smaller the p-value, the more confidently one can rejectH0.

  • 58/61

    Exercise: Cookies6

    Students are asked to count the num-ber of chocolate chips in 32 cookies fora class activity. They found that thecookies on average had 14.77 choco-late chips with a standard deviation of4.37 chocolate chips. The packagingfor these cookies claims that there areat least 20 chocolate chips per cookie.One student thinks this number is un-reasonably high since the average theyfound is much lower. Another studentclaims the difference might be due tochance. What do you think (computea p-value)?

    6from the textbook OpenIntro Statistics

  • 59/61

    Exercise: kiss

    Recall that in the Kiss example we observed 80 out of 124 couplesturning their head to the right. Formulate the statistical hypothesisproblem, compute the p-value and conclude.

  • 60/61

    Exercise : Machine learning predicts breast cancer

    A vast problem in breast cancer are false positive, that is surgeryperformed on benign tumors. A new machine learning procedureclaims to improve the state-of-the art (95% of false positive)significantly while preserving the same true positive rate (detectingmalignant tumors as malignant). To verify this claim, we collecteddata on 297 benign tumors. The algorithm recommended toperform surgery on 206 of them.

    Let p denote the proportion of benign tumors on which thealgorithm prescribes surgery.Formulate the statistical hypothesis problem, compute the p-valueand conclude.

  • 61/61

    Recap

    I A statistical model is a pair of the form (E, (IPθ)θ∈Θ) whereE is the sample space and (IPθ)θ∈Θ is a family of candidateprobability distributions.

    I A model can be well specified and identifiable.

    I The trinity of statistical inference: estimation, confidenceintervals and testing

    I Estimator: one value whose performance can be measured byconsistency, asymptotic normality, bias, variance andquadratic risk

    I Confidence intervals provide “error bars” around estimators.Their size depends on the confidence level

    I Hypothesis testing: we want to ask a yes/no answer about anunknown parameter. They are characterized by hypotheses,level, power, test statistic and rejection region. Under the nullhypothesis, the value of the unknown parameter becomesknown (no need for plug-in).