objectives*suhasini/teaching301/stat301... · sample ≠ population, and sample mean ≠ population...

45
Objectives 6.1, 7.1 Estimating with confidence (CIS: Chapter 10) Statistical confidence (CIS gives a good explanation of a 95% CI) Confidence intervals Choosing the sample size t distributions One-sample t confidence interval for a population mean How confidence intervals behave Adapted from authors’ slides © 2012 W.H. Freeman and Company

Upload: others

Post on 25-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Objectives   6.1, 7.1 Estimating with confidence (CIS: Chapter 10)

p  Statistical confidence (CIS gives a good explanation of a 95%

CI)

p  Confidence intervals

p  Choosing the sample size

p  t distributions p  One-sample t confidence interval for a population mean

p  How confidence intervals behave

Adapted  from  authors’  slides  ©  2012  W.H.  Freeman  and  Company  

Page 2: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Overview  of  Inference  p  Sample ≠ population, and sample mean ≠ population mean µ.

But we do not know the value of µ, and if we want to make any conclusions about µ then we have to use to do so.

p  Methods for drawing conclusions about a population from sample data are called statistical inference.

p  There are two main types of inference: §  Confidence Intervals - estimating the value of a population

parameter, and §  Tests of Significance - assessing evidence for a claim (hypothesis)

about a population.

p  Inference is appropriate when data are produced by either §  a random sample or §  a randomized experiment.

x

x

Page 3: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Introducing  con4idence  intervals  p  It is very unlikely that the sample mean based on a sample will ever

equal the true mean. Our aim is to construct an interval around the sample mean which is `likely’ to contain the mean. This is called a confidence interval. p  In the first lecture we considered a Gallop poll for the proportion of the

electorate that would vote for Obama. p  Gallup predicted that the Obama vote would be in the interval [45%,51%] with 95% confidence. p  The Obama vote turned out to be 50.5%, so the interval did capture the

true proportion. p  You may be asking yourself how do we understand 95%, since 50.5%

lies in this interval, there does not appear to be any uncertainty in it.

q  In the next few slides, our objective is to understand how a confidence interval is constructed and how to understand it.

Page 4: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Review:  properties  of  the  sample  mean  

The sample mean is a unique number for any particular sample. If you had obtained a different sample (by chance) you almost certainly would have had a different value for your sample mean.

In fact, you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, µ. €

x

Page 5: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Because the sampling distribution of is narrower than the population

distribution, by a factor of √n.

The the estimates

tend to be closer to

the population

parameter µ than individual

observations are.

n Sample means, n subjects

µ

σ

Population, x individual subjects

x

x

If the population is normally distributed N(µ,σ), the sampling distribution is N(µ,σ/√n),

x

Page 6: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

p  Using the empirical distribution, since the sample mean is close to normal, 95% of the time it will be within 2 standard errors of the mean, that is if I had a hundred sample means, then about 95 times the sample mean lies in the interval [µ – 2×σ/√n, µ + 2×σ/√n].

p  Now we make a small correction to the empirical rule. It is not 2 standard deviations of the mean, but 1.96 standard deviations from the mean. To see why, look up 1.96 in the z-tables.

p  But the mean is unknown, so our objective is to locate the true mean based on the sample mean.

p  To do this we turn the story around, if the sample mean lies in the interval [µ –1.96×σ/√n, µ+1.96×σ/√n], this is the same as saying the mean µ lies in the interval [sample mean –1.96×σ/√n, sample mean +1.96×σ/√n].

q  Thus 95% of the time, the true mean (that we want to estimate) will be in the interval

[sample mean –1.96×σ/√n, sample mean +1.96×σ/√n]. This is an interval which is centered about the sample mean. In the next slide we illustrate what we mean by 95%.

Page 7: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Red dot: mean value of individual sample

95% of all sample means will be within 1.96 (roughly 2) standard deviations

(1.96 ×σ/√n) of the population parameter µ.

This implies that the population parameter µ will be within 1.96 standard deviations from the sample average , in 95% of all samples.

This reasoning is the essence of statistical inference.

σ n

x

If  multiple  samples  were  possible  

Page 8: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Mean  height  –  sample  size  one  p  Human heights are approximately a normal distribution. The

standard deviation of a human height is 3.8 inches. p  Our objective is to construct a confidence interval for the mean

height. p  We start with a very crude estimator and use just one height to

estimate the mean, this is the same as using a sample of size one. In this case the standard error is 3.8/√1 = 3.8.

p  Each of you construct a 95% confidence interval for the mean height using your height as the sample:

[your height – 1.96×3.8, your height + 1.96×3.8] [your height – 7.44, your height + 7.44].For example, in my case the interval is [63 – 7.44, 63+7.44] = [55.56,70.44]. q  Each of you do this too. In fact it is known that the mean height of a

person is 67 inches. Does you interval contain the mean? The proportion of intervals that contain the mean should be approx 95%.

Page 9: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Mean  height  –  sample  size  two  p  In the previous experiment the we used just one individual to

estimate the mean height. The `cost’ of using one individual was that the confidence interval was very wide.

p  We repeat the experiment, but this time each of you buddy up with your neighbour and calculate the average height between the two of you (ie. (your height plus neighbour’s height)/2). You and your buddy for a sample of size two.

p  We know that this the sample mean based on a sample of size n=2. has the standard error 3.8/√2 = 2.68.

p  Each group construct the interval [sample mean – 1.96×2.68, sample mean + 1.96 × 2.68] = [sample mean ±5.26].

p  The mean height is 67 inches, does your interval contain the mean? p  What proportion of the intervals in the class contain the mean?

Page 10: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Observations  p  We see that the length of confidence interval when using just one

person in the sample is 2×7.44 = 14.88, this is quite long, and does not really allow us to pinpoint the mean.

p  Whereas the length of interval using two people to calculate the sample mean is 10.52, this is quite a big reduction in length!

p  If ten people were used to calculate the sample mean the corresponding interval length would be 14.88/√10 = 4.7.

p  We see that for any given interval either the mean is in this interval or not. The 95% comes into play when we look at the proportion of intervals that contain the mean.

p  In reality: p  We do not know the true mean µ, so will never know whether the interval

contained the mean or not. p  We only observe one sample of size n, and thus have one CI.

p  One confidence interval contain information about the mean. This is why we say with 95% confidence the mean lies in it.

Page 11: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Implications  We do not need to (and cannot, anyway) take a lot of random samples to “rebuild” the sampling distribution and find µ at its center.

n

n

Sample Population

µ

All we need is one SRS of size n and we can rely on the properties of the sampling distribution to infer reasonable values for the population mean µ.

Page 12: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Multiple  samples  revisited  With 95% confidence, we can say that µ should be within 1.96

standard deviations (1.96×σ/√n) from our sample mean .

p  In 95% of all possible samples of

this size n, µ will indeed fall in our

confidence interval.

p  In only 5% of samples will be

farther from µ.

p  “Confidence” = the proportion of

possible samples that give us a

correct conclusion.

σ n

x

x

Page 13: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation  practice  p  You want to rent an unfurnished one-bedroom apartment in Dallas.

The mean monthly rent for 10 randomly sampled apartments is 980 dollars. Assume that monthly rents follow a normal distribution with standard deviation 280 dollars. Construct a 95% confidence interval for the mean monthly rent of a one-bedroom apartment. p  The standard error for the sample mean is 280/√10 = 88.54. p  Thus the 95% CI is [980 ±1.96×88.54] = [806,1153]. With 95%

confidence we believe the mean price of one-bedroom apartments in Dallas lies in this interval.

p  Does the above confidence interval mean that 95% of all rents should lie in this interval? p  No, it is the interval for the mean. If we want the interval where 95% of

all rents should lie it is [980 ±1.96(88.54+280)] = [257,1720]. You do not have to understand the calculation, but you will notice this interval is much wider. The reason is that it must capture 95% of all rents, which are extremely varied. The previous CI was just capturing the mean rent, based on the sample mean, which is much less varied.

Page 14: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation  practice  p  Hypokalemia is diagnosed when the blood potassium level is below

3.5mEq/dl. The potassium in a blood sample varies from sample to sample and follows a normal distribution with standard deviation 0.2.

p  A patient ‘s potassium is measured taken over 4 days. The sample mean level over these 4 days is 3.7. Construct a 95% confidence interval for the mean potassium and discuss whether the patient is likely to be diagnosed with Hypokalemia. p  The standard error for the sample mean is 0.2/√2 = 0.1. Thus the 95%

confidence interval for the mean potassium level is [3.7±1.96×0.1] = [3.504,3.894]. This means with 95% confidence we believe the mean lies in this interval. q  Since 3.5 or less does not lie in this interval, with 95% confidence I can

say that the patient does not have this condition.

Page 15: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Con4idence  interval  misunderstandings  p  Suppose 400 alumni were asked to rate the University of Okoboji

the university counseling services on a scale 1 to 10. The sample mean was found to be 8.6 and it is known that the standard deviation is σ=2. Ima Bitlost has done the analysis, but has made some mistakes.

p  Ima computes the 95% CI interval for the mean satisfaction score as [8.6±1.96×2]. What is her mistake? p  Ima has not taken into account that the sample mean has a much

smaller standard deviation (standard error) than the population. The standard error is 2/√400 = 0.1. Thus the true CI is

[8.6±1.96×0.1] = [8.4,8.796].

p  After correcting her mistake, she states that “I am 95% confident that the sample mean lies in the interval [8.4,8.796]” What is wrong with her statement? p  This is a meaningless statement, for sure the sample mean lies in this

interval! It is the population mean that we are 95% confident lies there.

Page 16: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

p  She quickly realizes her mistake and instead states “the probability that the mean lies in the interval [8.4,8.796] is 95%”, what misinterpretation is she making now? p  By 95%, we mean that if we repeated the experiment many times over

about 95% of the time the intervals will contain the mean. For any given interval the mean is either in there or not. There is no probability attached to it. To overcome, this issue we say that with we have 95% confidence in the mean lies in this interval.

p  Finally, in her defense for using the normal distribution to determine the confidence coefficient (1.96) she says “Because the sample size is quite large, the population of alumni ratings will be close to normal”. Explain to Ima her misunderstanding. p  The distribution of the population always stays the same, regardless of

the sample size (in this case, it is clear that variables that take integer values between 1 to 10 cannot be normal). However, the sample mean does get closer to normal as the sample size grow. With a sample size of 400, the distribution of the sample mean will be very close to normal.

Page 17: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Different  levels  of  con4idence  p  There is no need to restrict ourselves to 95% confidence intervals. p  The level of confidence we use really depends on how much

confidence we want. For example, you would expect a 99% confidence interval is more likely to contain the mean than a 95% confidence interval.

p  To construct a 99% confidence interval we use exactly the same prescription as used to construct a 95% confidence interval, the only thing that changes is 1.96 goes to 2.57 (if you look up -2.57 in the z-tables you will see this corresponds to 0.5%, so 99% of the time the sample mean will lie within 2.57 standard errors from the mean). p  A 99% CI for the mean one-bedroom apartment price is [980±2.56×88.54]. Length of interval is 2×2.57×88.54 q  A 90% CI for the mean one-bedroom apartment price is [980±1.64×88.54]. Length of interval is 2×2.56×88.54

What does a 100% confidence interval look like? In a 100% CI we are sure to find the mean, but this interval is so wide it is not informative.

Page 18: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Sample  size  and  length  of  the  CI  p  Let us return to the apartment example. We recall that for the

confidence interval for the mean price is [980 ±1.96×88.54] = [806,1153]. The length of this interval is 2×1.96×88.54 = 347.

p  What happens to the length of interval if I increase the sample size? p  Suppose I take a SRS of 100 apartments in Dallas, the sample

mean based on this sample is 1000, what will the CI be? p  The standard error is 280/√100 = 28 (much smaller than when the

sample size is 10), and the CI is [980 ±1.96×28]. The length of this interval is 2×1.96×28 =109.

p  What we observe is: p  The length of the interval does not depend on the sample mean, this is

just the centralizing factor. It only depends on 1.96, the standard deviation and the sample size.

p  The length of the interval gets smaller as the sample size grow.

p  This suggests that if we want the interval to have a certain level of precision, we can choose the sample size accordingly.

Page 19: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Margin  of  Error  p  Margin of error is the lingo used for the plus and minus part in the

confidence interval. p  That is the confidence interval is [sample mean±1.96×σ/√n], the margin of error is 1.96×σ/√n.

q  For example, in the previous example the margin of error for the CI based on 10 apartments is 1.96×88.54.

q  The margin of error for the CI based on 100 apartments is 1.96×28.

q  The margin of error in some sense, is a measure of accuracy. The smaller the margin error the more precisely we can pinpoint the true mean.

q  Suppose we want the margin or error to be equal to some value, then we can find the sample size such that we obtain that margin of error. Solve for n the equation MoE = 1.96×σ/√n (the Margin of Error and the standard deviation σ are given). See the next slide for an example.

Page 20: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation  practice:  What  sample  size  for  a  given  margin  of  error?  Annual coffee sales:

A marketing firm plans to study the annual sales in coffee shops. They want to estimate the mean annual sales to within $0.2 million, this time with 98% confidence. How many coffee shops should they sample to obtain a margin of error of at most $0.2 million with a confidence level of 98%? From a previous study they guess σ ≈ $1.03 million. To solve the formula we need to find the correct z-score that will give a 98% CI. Looking up the tables we see

The z* = 2.326. Thus we solve the equation:

From the calculation, we see they need 144 observations such that the margin of error is 0.2million.

2 22* 2.326 1.03 12.0 144.

0.2znmσ ×⎛ ⎞ ⎛ ⎞≈ ≈ = =⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠

Page 21: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation  practice  p  In a study of bone turn over in young women with a medical

condition, serum TRAP was measured in 31 subjects. The sample mean was 13.2 units per liter. Assume the standard deviation is known to be 6.5U/l. Find the 80% CI for the mean serum level. p  Look up 10% in the z-tables, this gives 1.28. The standard error for the

sample mean is 6.5/√31 = 1.16. Altogether this gives the CI [13.2±1.16×1.28] =[11.7,14.6]. This means with we believe with 80% confidence the mean level of serum for women with this medical condition should lie in this interval. By choosing such a low level of confidence our interval is quite narrow, but our confidence in this interval is relatively low.

q  How large a sample size should we choose such that the 80% CI for the mean has the margin of error 1U/l. q  This means solving 1.28×6.5/√n = 1, n=(1.28×6.5/1)2 =70.

Page 22: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

A confidence interval for µ can be expressed two ways.

p  ± m. m is called the margin of error Egg carton example: 64.17g ± 2.83 g. We say “We conclude that µ is within 2.83g of 64.17g, with 95% confidence.”

p  Two endpoints of an interval: ( − m) to ( + m).

Egg carton example: 61.34g to 67.00g. We say “We conclude that µ is between 61.43g and 67.00g, with 95% confidence.” Again, the confidence level C is the proportion of possible samples for

which the conclusion is correct . That is, it is the proportion of possible

samples for which the interval contains µ. (C usually is given in %.)

But there is an important issue to deal with.

§  We do not know the value of σ any more than we know the value of µ.

x €

x

x

Page 23: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

When  σ  is  unknown  

p  When the sample size is large, the sample is likely to contain elements representative of the whole population. Then s is a good estimate of σ.

Population distribution

Small sample Large sample

p  But when the sample size is small, the sample contains only a few individuals. Then s is a mediocre estimate of σ.

p  The data is unlikely to contain values in the tails and, s is likely to underestimate σ.

In the case the we can estimate the standard deviation from the data. The sample standard deviation s provides an estimate of the population standard deviation σ.

Page 24: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

The  z-­‐transform  with  estimated  standard  deviation  p  Simply replacing the true standard deviation with the estimated

standard deviation can have severe consequences on the confidence interval if we do not correct for it.

p  To see why consider the z-transforms of the sample mean with known and estimated standard deviations: p  (sample mean - µ)/(σ/√n) p  (sample mean - µ)/(s/√n)

p  In the first case, z-transform will be a standard normal. In the second case the estimated standard deviation adds extra variability into the `system’. In particular, because s can be small then σ, this means the z-transform can be larger and take higher values then we would expect for a standard normal.

p  In the next few slides we show that when we estimate the standard deviation the z-transform is no longer a standard normal, but the so called t-distribution.

Page 25: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

How  brewers  saved  statistics  p  Just over 100 years ago, W.S.

Gosset was a biometrician who worked for Guiness Brewery in Dublin, Ireland.

p  Gosset realized that his inferences with small sample data seemed to be incorrect too often – his true confidence level was less than it was stated to be!

p  He worked out the proper method that took into account substituting s for σ.

p  But he had to publish under a pseudonym: Student.

p  Gosset’s theory is based on the distribution of the quantity

p  This looks like the z-score for , except that s replaces σ in the denominator.

.xts n−µ

=

x

Page 26: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Student’s  t  distributions  

Suppose that an SRS of size n is drawn from an

Normal(µ,σ) population.

p  When σ is known, the sampling distribution for is Normal(0,1).

p  When σ is estimated from the sample standard deviation s, the

sampling distribution for will be very close to normal if the

sample size n is large. This is because for large n, s will be a very reliable estimator of σ.

q  However, in the case that n is not so large, the variability in s will have an impact on the distribution.

q  It is clear that the impact it has depends on the sample size.

xts n−µ

=

xzn

−µ=σ

Page 27: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Student’s  t  distributions  p  When σ is estimated from the sample standard deviation s, the

sampling distribution for will depend on the sample size.

The sample distribution of

is a t distribution with n − 1 degrees of freedom.

p  The degrees of freedom (df) is a measure of how well s estimates σ. The larger the degrees of freedom, the better σ is estimated.

q  This means we need a new set of tables!

xts n−µ

=

xts n−µ

=

Page 28: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

When n is very large, s is a very good estimate of σ, and the corresponding t distributions are very close to the normal distribution.

The t distributions become wider (thicker tailed) for smaller sample sizes, reflecting that s can be smaller than σ, so the corresponding t-transform is more likely to take extreme values than the z-transform.

Page 29: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Suppose we want to construct the C% confidence interval for the mean.

The standard deviation is unknown, so as well as estimating the mean

we also estimate the standard

deviation from the sample.

Example: For an 80% confidence level C, 80% of Student’s t curve’s

area is contained in the interval.

Impact  on  con4idence  intervals  

Practical use of t: t* p  t* is related to the chosen confidence level C.

p  C is the area under Student’s t curve between −t* and t*.

nstx *±

The confidence interval is thus:

C

t* −t*

Page 30: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Con=idence  level  and  the  margin  of  error  

The confidence level C determines the value of t* (in table D).

The margin of error also depends on t*.

*m t s n= ×

C

t* −t*

§  Higher confidence C implies a larger

margin of error m (thus less precision

in our estimates).

§  A lower confidence level C produces

a smaller margin of error m (thus

better precision in our estimates).

§  We find t* in the line of Table D for df

= n−1 and confidence level C.

Page 31: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Table  D  

When the sample is very large, we use the normal distribution and the standardized z-value.

When σ is unknown, we use a t distribution with “n−1” degrees of freedom (df). Table D shows the z-values and t-values corresponding to landmark P-values/ confidence levels.

xts n−µ

=

Page 32: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

p  Focus first on 2.5%. For each n, the 2.5% corresponds to the area on the left and right tails of the t-distribution with n degrees of freedom. Remember a distribution gives the chance/likelihood of certain outcomes.

p  Recall that for a normal distribution, the point where we get 2.5% on the left and the right of the tails of the distribution is 1.96.

p  If we go down the table. we see that as the sample size, n, increases the value corresponding to 2.5, goes from 12.71 (for n=1) to a number that is very close to 1.96 for extremely large n.

p  This means for small n the variability on the standard deviation s means that the chance of the t-transform being extreme is relatively large.

p  However, as n grows, the estimator of the standard deviation improves, and the t-transform gets closer to a normal distribution.

p  You will observe the same is true for other percentages. Take a look at 5% and 0.5% and look down the table.

Page 33: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation practice (red wine 1)

It has been suggested that drinking red wine in moderation may protect against heart attacks. This is because red wind contains polyphenols which act on blood cholesterol.

To see if moderate red wine consumption increases the average blood level of polyphenols, a group of nine randomly selected healthy men were assigned to drink half a bottle of red wine daily for two weeks. The percent change in their blood polyphenol levels are presented here:

0.7 3.5 4.0 4.9 5.5 7.0 7.4 8.1 8.4

Sample average = 5.50 Sample standard deviation s = 2.517

Degrees of freedom df = n − 1 = 8

x We will encounter two problems when doing the analysis. The first is that the sample size is not huge so we have to hope that the sample mean is close to normal. The second is the standard deviation is unknown and has to be estimated from the data.

Page 34: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

q  What is the 95% confidence interval for the average percent change?

p  First, we determine what t* is. The degrees of freedom are df =

n − 1 = 8 and C = 95%.

p  The margin of error m is: m = t* × s/√n = 2.306 × 2.517/√9 ≈ 1.93. So the 95% confidence interval is 5.50 ± 1.93, or 3.57 to 7.43.

p  We can say “With 95% confidence, the mean of percent increase is between 3.57% and 7.43%.”

p  What if we want a 99% confidence interval instead? p  For C = 99% and df = 8, we find t* = 3.355. Thus m = 3.355 × 2.517/√9 ≈ 2.81.

p  Now, with 99% confidence, we only can conclude the mean is between 2.69 and 8.31. (A big price to pay for the extra confidence.)

(…)

From Table D we get t* = 2.306.

Page 35: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation practice (red wine 2)

Let us return to the same study, but this time we increase the sample size to 15 men. The data is now:

0.7,3.5,4,4.9,5.5,7,7.4,8.1,8.4, 3.2,0.8,4.3,-0.2,-0.6,7.5 The sample mean in this case is 4.3 and the sample standard deviation is 3.06.

Since the sample size has increased, it is likely that the sample standard deviation is a more reliable estimator of the true standard deviation.

The number of degrees of freedom is 14.

Just as in the previous example we can construct a 95% confidence interval but now we use 14df instead of 8dfs.

Page 36: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

More  calculation  practice  p  Let us return to the example of prices of apartments in Dallas. 10

apartments are randomly sampled. The sample mean and the sample standard deviation based on this sample is 980 dollars and 250 dollars (both are estimators based on a sample of size ten). Construct a 95% confidence interval for the mean: p  The standard error is 250/√10 = 79. p  Looking up the t-tables at 2.5% and 9 degrees of freedom gives 2.262. p  The 95% confidence interval for the mean is [980 ±

2.262×79]=[801,1159].

q  Suppose we want to know whether the price of apartments have increased since last year, where the mean price was 850 dollars. q  Based on this interval we see that 850 dollars and greater is contained in

this interval. This means the mean could be 850 dollars or higher. There given the sample it is unclear whether the mean price of apartments has increased since last year or not.

Page 37: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Example:  comparing  z  and  t-­‐values  

p  We want to calculate a 99% CI for the mean weight of a newborn calf. To do this upload the calf data into Statcrunch.

p  Go to Stat, from here you have two options. If we treat the standard deviation of calve weights as known (not random), then we can use the z-statistic, else we need to use the t-statistic.

p  Suppose we choose t-statistic option -> one-sample -> with data -> then choose the variable of weights at birth (wt 0). To get the 99% CI we need to select 99% on the second pages of the options. We get the 99% interval (using a t-distribution with 43 degrees of freedom) [90.05,96.37] pounds. This means we 99% confidence we believe the mean weight of new born calves lies in this interval.

p  To see how well the normal distribution works, we do the same, but choose the z-statistic option. This gives the 99% confidence interval [90.19,96.23]. Notice, that it is slightly narrower, because it does not take into account the underestimation of the standard deviation.

Page 38: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

More  calculation  practice  p  Let us return to the M&M data. Suppose we want to calculate a 99%

confidence interval for the mean number of M&Ms in plain, peanut butter and peanut M&Ms. These can be calculated using the summary statistics output: Summary statistics for Total:Group by: Type

Type n Mean Variance Std. Dev. Std. Err. Median Range Min Max Q1 Q3

M 84 17.297619 8.259753 2.8739786 0.3135768 18 14 7 21 17 19

P 40 8.675 9.814744 3.1328492 0.49534693 8 15 6 21 7 8

PB 46 10.913043 3.325604 1.8236238 0.26887867 11 10 8 18 10 11

Using this output we can calculate the confidence intervals for the mean number of M&Ms in each type. Do this.

Page 39: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Statcrunch  will  also  give  the  CIs    p  Go to Stats -> t-statistics -> one-sample -> with data -> select the

column you want to analyse (choose the Group by if you want it grouped), on the next page select confidence interval and the level you want it at.

Sample mean Std. err DF L Limit U limit 17.2 0.31 83 16.4 18.12 8.6 0.49 39 7.33 10.01 10.9 0.268 45 10.18 11.63

Looking at the intervals, do you think it that the mean number of M&Ms in a plain and peanut bag could be the same. What about the mean number in peanut and peanut butter? Later on we shall make a formal test on these questions.

Page 40: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Calculation  practice:  coffee  shop  sales  

p  The degrees of freedom is 45−1 = 44. p  For 90% confidence, we find t* = 1.680. p  The margin of error is 1.680×1.03/√45 = 0.258 p  So the interval for the true mean is 2.67 ± 0.26. p  “We conclude that the mean annual sales of all coffee shops is between $2.41 million and $2.93 million, with 90% confidence.”

A marketing firm randomly samples 45 coffee shops and determines their annual sales. The sample has an average of $2.67 million and a standard deviation of $1.03 million. What can we say with 90% confidence about the mean annual sales for the population of all coffee shops?

nstx *±

Page 41: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Summary  of  con4idence  interval  for  µ.  p  The confidence interval for a population mean µ is

p  t* is obtained from Student’s t distribution using n−1 degrees of

freedom. (Table D in the textbook.) p  t* is the value such that the confidence level C is the area between

–t* and t*.

p  Confidence is the proportion of samples that lead to a correct conclusion (for a specific method of inference). p  The investigator chooses the confidence level C. p  Tradeoff: more confidence means bigger margin of error, wider

intervals.

p  The degrees of freedom is associated with s, the estimate for σ. p  The margin of error also depends on the sample size:

larger samples are better.

* /t s n

* .x t s n±

Page 42: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Sample  size  and  experimental  design  

An investigator may need a certain margin of error m (e.g., in a marketing survey, in a drug trial, etc.).

So plan ahead what sample size to use to achieve that margin of error.

You will have to guess the value of σ, perhaps from historical data, and you will not know the degrees of freedom at first. But you can do a rough calculation.

This is done in the planning stages of the study. It is not an inference or conclusion and there are no data yet.

Remember, too, that there typically are costs and constraints associated with large samples. Economy and feasibility are factors that will tend to keep sample sizes smaller.

2** .zm z nmn

σ σ⎛ ⎞≈ ⇔ ≈ ⎜ ⎟⎝ ⎠

Page 43: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Interpretation  of  con=idence,  again  p  The confidence level C is the proportion of all possible random

samples (of size n) that will give results leading to a correct conclusion, for a specific method.

p  In other words, if many random samples were obtained and confidence intervals were constructed from their data with C = 95% then 95% of the intervals would contain the true parameter value.

p  In the same way, if an investigator always uses C = 95% then 95% of the confidence intervals he constructs will contain the parameter value being estimated.

p  But he never knows which ones do! p  Changing the method (such as changing the value of t*) will change

the confidence level. p  Once computed, any individual confidence interval either will or will

not contain the true population parameter value. It is not random. p  It is not correct to say C is the probability that the true value falls in

the particular interval you have computed.

Page 44: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions

Cautions  about  using    

p  This formula is only for inference about µ, the population mean. Different formulas are used for inference about other parameters.

p  The data must be a simple random sample from the population.

p  The formula is not quite correct for other sampling designs. (But see a statistician to get the right inference method.)

p  Confidence intervals based on t* are not resistant to outliers.

p  If n is small and the population is not normal, the true confidence level could be smaller than C. (Usually n ≥ 30 suffices unless the data are highly skewed.)

p  This inference cannot rescue sampling bias, badly produced data or computational errors.

* /x t s n± ×

Page 45: Objectives*suhasini/teaching301/stat301... · Sample ≠ population, and sample mean ≠ population mean µ. But we do not know the value of µ, and if we want to make any conclusions