estimation i: 1 an introduction to estimation. estimation i: 2 biostatistics serves two purposes...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Estimation I: 2
Biostatistics serves two purposes (among others):
To use information from a sample of data to
1. Describe our best guess of the characteristics of
the population.
• Best guess estimation
2. Gauge the plausibility of alternative explanations
for what is observed in the sample.
• Hypothesis testing
Estimation I: 3
Estimates vs Parameters:
Sample Statistics Population Parameters
Mean:Variance: S2
Mean: Variance: 2
What does it mean to say we know X from a sample, but we don‘t know ?
• We can observe a sample mean and then use it as an estimate of the true, unknown population mean.
Standard Deviation: S Standard Deviation:
X
Estimation I: 4
Some Notation / Definitions
1. Estimation: The computation of a statistic from sample data for the purposes of obtaining a guess of the unknown population parameter value.
2. Estimator: The label given to the statistic that is to be calculated in estimation.e.g., sample mean,
sample standard deviation, S
1. Estimate: The value of the estimator takes when calculated using an actual sample of data.e.g., x = 10 minutes
s = 3 minutes
X
Estimation I: 5
What criteria should we use to define a good estimate?
1) In the long run, “correct”if we imagine sampling over and over, the average of repeated sampling should result in the correct answer: UNBIASEDNESS
2) In the short run, “in error by as little as possible”(most of the time, it should be “close” to the true value)
This is the concept of precision.
It is also called the statistical concept of
minimum variance.
Estimation I: 6
Example: Is the sample mean or the sample median a better choice as an estimate of µ, the true population mean, for the normal distribution?
1. Unbiasedness:Both the sample mean and the median are unbiased estimates of µ. (Note this is true for the Normal Distribution, but does not hold for all distributions).
2. Precision:For the sampling distributions of sample means and sample medians, it can be shown that
Variance(sample means) < Variance(sample medians)
For X~ Normal, the sample mean is said to be a minimum variance unbiased estimator (mvue) of µ.
Estimation I: 7
If the data are normally distributed:
X ~ N(, 2) ~ N(, 2/n)
That is, we know that the sampling distribution of sample means from this population will follow
• a Normal distribution with
• the same mean as the underlying population
• A decreased variance relative to the underlying population: 2/n
X
Estimation I: 8
We can then make statements about the probability of observing a value of within some interval around :
n n
68%
1.96n 1.96
n
95%
• is within 1 standard error of 68% of the time.
• is within 1.96 standard errors of 95% of the time.
• is within 2.576 standard errors of 99% of the time.
X
X
XX
Estimation I: 9
2 types of Estimators
Point Estimators Interval Estimators
Single best guess
Form of estimate:
a value
e.g., x = 10 ml
Range of values
Form of estimate:
(lower limit, upper limit)
e.g., (5, 15) ml
We’ve been working with point estimators;
Our next step is to define interval estimators.
Estimation I: 10
There are 3 ingredients to a confidence interval:
1. a point estimator (e.g., x)
2. the SE of the point estimator (e.g., /n)
3. a confidence coefficient with an associated probability (e.g., a percentile of a Normal distribution)
Estimation I: 11
The form of the confidence interval is then:
Lower limit = (Point Estimator) – [(conf. coeff)
(SE)]
Upper limit = (Point Estimator) + [(conf. coeff)
(SE)]
For example, for a mean:
LL = – c (/n)
UL = + c (/n)
Next Step: where does ‘c’ come from?
X
X
Estimation I: 12
Interpretation of a 95% Confidence Interval
Population
( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )( )
( )
x
In repeated sampling, each sample gives rise to a point and interval estimate of the mean.
Estimation I: 13
Each sample gives rise to its own
• point estimate and
• confidence interval estimate built around the
point estimate.
We will construct our intervals so that:
• If all possible samples of a given sample size
were drawn from the underlying distribution and
• each sample gave rise to its own interval estimate
• then 95% of all such intervals would include the
unknown µ while 5% would not
Estimation I: 14
Interpreting Confidence Intervals:Example: Take a sample, estimate xbar, and compute an interval estimate for the mean: (1.3, 9.5)
Correct:“95% of intervals constructed in this manner will include the population mean .”
Incorrect:“The probability that the interval (1.3, 9.5) contains µ is 0.95.”
The 2nd statement is incorrect because once an interval is constructed it either contains the mean or it doesn’t. Since we don’t know mu, we just don’t know which it is.
Estimation I: 15
Computing a Confidence Interval for
Case 1: 2 known
Example: The weight in micrograms of a drug inside of capsules is Normally distributed, with 2 = 0.25
We are given a sample of n=30 capsules with a mean weight of 0.51 micrograms, and asked to construct a 95% confidence interval estimate of the population mean weight.
1. The point interval estimate is x = 0.51 gm
2. The standard error of x is:
0.250.091
30x gm
n
Estimation I: 16
3. We set the confidence coefficient to equal
z0.975=1.96 for a 95% confidence interval.
We can then compute the lower and upper limits as:
0.51 ( 1.96)(0.091) 0.332lLL x zn
0.51 (1.96)(0.091) 0.688uUL x zn
The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms.
Estimation I: 17
Deriving the expression for a Confidence Interval
We want: Pr (zl Z zu) = 1 = 0.95
zl = 2.5th percentile of Normal
zu = 97.5th percentile of Normal
0
This shaded area is /2 = .025
This area is (1- ) = .95
zl zu
We split the excluded area (=.05) symmetrically around the mean.
This shaded area is /2 = .025
Estimation I: 18
Recall that we want the Inverse Cumulative Distribution to find a percentile of the Normal.
In Minitab: Calc Probability Distributions Normal
percentile:
.025, .975
Estimation I: 19
We can find the percentiles for:
/2 = .05/2 = .025 z.025 = -1.96
and
1 – /2 = 1 – (.05/2) = .975 z.975 = 1.96
We now have: Pr[-1.96 Z 1.96) = .95
0
This shaded area is /2 = .025
This area is (1- ) = .95
z.025
This shaded area is /2 = .025
z.975
Estimation I: 20
Derivation of a confidence interval on X:
We know (simple standardization): /
XZ
n
Pr( ) Pr
/l u l u
Xz Z z z z
n
Pr l uz X zn n
Pr l uX z X zn n
Substitute in for Z
Multiply by SE
Subtract X
This is of form: (pt. estimate) + (conf. coeff) (std error)
Pr l uX z X zn n
Estimation I: 21
When zl=-1.96 and zu=1.96 ,
We can then compute the lower and upper limits as:
The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms.
Pr 0.95l uX z X zn n
0.51 1.96 0.091 0.332
0.51 1.96 0.091 0.688
l
u
LL x zn
UL x zn
Estimation I: 22
Bottom Line:
ConfidenceInterval
Estimate
PointEstimate
ConfidenceCoefficient
StdError
X Percentile From N(0,1) n
Commonly used Confidence Coefficients from N(0,1):
• For a 90% confidence interval z.95 = 1.645
• For a 95% confidence interval z.975 = 1.96
• For a 99% confidence interval z.995 = 2.576
Estimation I: 23
Example
a. What is the 95% confidence interval estimate
of the year 2004 population mean height?
b. the 99% confidence interval estimate?
In 1990 U.S. census, the mean height of men in the US was = 69 inches, and = 3 inches.
By the year 2004 heights may have changed but we will assume the standard deviation is the same, and known: = 3 inches.
Since we can’t afford to measure the whole population we’ll take a sample of n=100 men.
We observe a mean height of x = 70 inches.
Estimation I: 24
Solution
This area is .025
z.975=1.96
We want (1) = 0.95 for a 95% confidence interval:
.95
370 ( 1.96)( ) 69.4
100lLL x zn
370 (1.96)( ) 70.6
100lUL x zn
Estimation I: 25
Do we have evidence that heights of men have changed since 1990?
That is, is the 1990 mean height within this interval?
Since 69 inches is outside of the 95% confidence interval, and most samples would result in a confidence interval that includes the population mean, it seems reasonable to conclude that heights of men in the U.S. may have changed.
We suspect that the mean height is greater in 2004 than in 1990.
The 95% C.I. estimate of the population mean height is (69.4, 70.6) inches.
Estimation I: 26
b. For desired confidence = 0.99
This area is 0.005
z.995 2.576
.99
.995
399% 70 (2.576)( )
100CI x z
n
The 99% confidence interval estimate is (69.2, 70.8) inches.
Since 69 inches is outside of the 99% confidence interval, it seems reasonable to conclude that heights of men in the U.S. might have changed. The average height in 2004 appears greater than in 1990.
Estimation I: 27
Note that the 99% confidence interval is wider than the 95% CI.
• To have greater confidence that we know the mean, based upon the same sample, we have a wider interval of values for .
x
69.2 69.4 70 70.6 70.8
95% CI
99% CI
Estimation I: 28
Hint on the confidence coefficient:
For a (1) C.I.:
use the 1 percentile of the N(0,1).
( 1 - )
Z1-(/2)
For example, for a 95% confidence interval, (1– ) = .95
Thus, = .05 so that /2 = .025
1 – (/2) = 1 – .025 = .975
Thus, we want the .975 or 97.5th percentile for a 95% confidence interval.
Z/2
Estimation I: 29
Another Example:
A random sample of 25 women has a mean systolic
blood pressure = 120 mmhg. Assuming the
underlying distribution of SBP across women is
Normal with = 10 mmhg, find the 99% confidence
interval estimate of the unknown true mean, µ.
Solution:
1. Point Estimate: x = 120
. = 10 is known
x= / n = 10 / 25 = 10 / 5 = 2
Estimation I: 30
0.99 This area =.005
This area =.005
3. To get confidence coefficient:
(1) = 0.99 = .01
/2 = .005 1– (/2) = .995
z.995 = 2.576
z.995 = 2.576
With 99% confidence, the mean systolic blood pressure of the population of women that this sample represents, is between 114.9 and 125.2 mmhg.
4. Confidence Interval: Pt. Est. (Conf.Coeff)(SE)
.99599% 120 (2.576)(2) (114.9,125.2)CI x zn
Estimation I: 31
Recap:A confidence interval for a mean has the form:
This holds when:
1. The data are normally distributed, with known variance, 2
2. The data are not normally distributed, but the sample size n is large, so that the sampling distribution of the mean is approx. normal
ConfidenceInterval
Estimate
PointEstimate
ConfidenceCoefficient
StdError
X Percentile From N(0,1) n
Estimation I: 32
Up to this point we have been looking at
• estimation of an unknown population mean ()
• using data from a sample ( x and C.I.)
• assuming that we “know” the population variance, 2.
In reality, we typically have to
• estimate the population variance, using s2
• along with estimating the mean
• from the same sample.
How does this effect confidence interval estimation for a mean?
Estimation I: 33
What do we do if is UNKNOWN?
2. It seems like a reasonable idea to replace with “s”Recall: s is the sample standard deviation
1. We know how to calculate a confidence interval
for the mean when 2 is known:
2 2 2
1
1, ( )
1
n
ii
where x xn
S S S
Note that estimation of depends upon our estimate of µ .
1 /2X Zn
Estimation I: 34
3. The snag is that we can no longer use the multiplier
z from the Normal distribution. In particular
~ (0,1)/
XZ N
n
~ ? ( )
/
Xt not NormalS n
When we replace the true (but unknown) value of the standard error with an estimate of the standard error:
• Instead of a Z-score, we now have
• A t-statistic:
/
XtS n
Estimation I: 35
This random variable, t, is said to follow
• a Student’s t-distribution
• with degrees of freedom = n-1
• IF the underlying data come from a Normal
Distribution!!
That is:
If
Then1~
/n
Xt tS n
2~ ( , )X N
Estimation I: 36
Features of the Student’s t-Distribution
1~/
n
Xt tS n
The Student’s t distribution is Bell-shaped Symmetric about zero Flatter than the Normal (0,1). This means
• Variability is greater• More area under the tails, less at center• Resulting confidence intervals will be wider.
N(0,1)
0
tn-1
Estimation I: 37
This greater variability or spread of the t-distribution should make intuitive sense –
• we are using an estimate of the standard error rather than the true value
we have added uncertainty in our confidence interval
Each degree of freedom (df) defines a separate t-distribution
• The greater the df, the closer to the normal distribution df = n-1
As n gets large, tn-1 N(0,1)
Estimation I: 39
d.f. t .90 … t .9951 3.078 63.657
2 1.886 9.925
How to Use the Table of Percentiles of the Student’s t-distribution
Table 5 (p. 757) in Rosner
Each row gives information for a separate
t-distribution defined by the df=n-1
The column heading tells you which percentile
will be given to you in the body of the table.
The body of the table is comprised of values
of the percentile
. . .
Estimation I: 40
This number is the percentile in the body of the table.
t-distribution with 1 df
3.078
.90
This area = 0.90
• From the first row (df=1), under the column t.90
• Read tdf=1,.90 = 3.078
That is,
• Pr(tdf=1 3.078) = .90
Estimation I: 41
Using Minitab to Get Percentiles of the t-distribution
Calc Probability Distributions t…
Select Inverse Cumulative Prob to get percentile
Enter df = n-1
Enter desired Percentile
Estimation I: 42
Inverse Cumulative Distribution Function
Student's t distribution with 1 DF
0.9000 3.0777
Using Minitab to Get Percentiles of the t-distribution
tdf=1,.90 = 3.078
.90
t-distribution with 1 df
1,0.90t1,0.90( )P t t
Estimation I: 43
Computation of a confidence interval for the mean when the population variance is unknown:
• Replace the confidence coefficient from the N(0,1)
• with one from tn-1
• use an estimate of the standard error in place of the true standard error
When 2 known
When 2 UNknown
1 / 2( )( / )X z n
1;1 / 2( )( / )nX t nS
Estimation I: 44
Example
A random sample of n = 20 recent cardiac bypass surgeries has a mean duration of x = 267 minutes, and sample variance s2 = 36,700 min2.
Assuming the underlying distribution of surgery duration is normal with unknown variance, find the 90% CI estimate of the true mean duration of surgery, .
1. x = 267
2. s = 36700 = 191.6
3. se(x) = s / n = 191.6 / 20 = 42.8
Estimation I: 45
4. Find the confidence coefficient from the t-distribution
a. n = 20 df = 19
b. For (1) = .90 = .10 /2 = .05
1/2 = .95
c. tdf;1/2 = t19;.95 = 1.729
This area = .05
Look up value ofthe 95th percentile
90%t19
-1.73 t19;.95 =1.729
This area = .0595%
Estimation I: 46
A 90% confidence interval for the true mean duration of surgery is (193.2, 340.8) minutes.
5. 90% CI = Pt. Est. (Conf Coeff)(Std Error Est.)
= x (t19;.95) ( s / n )
= 267 (1.729) (42.8)
= (193.2 , 340.8)
Estimation I: 47
Estimation Highlights / Main Points
1. Estimation provides “guesses” of unknown population parameter values using information from a sample
2. While there may be many criteria for the selection of a “good” estimate, we’ll use two criteria:
1) UNbiasedness (in the long run, correct)
2) Minimum variance (in the short run, the smallest error possible)
Estimation I: 48
1. We can calculate both point and interval
estimates
2. Confidence interval estimates have the
advantage of providing a sense of the
precision in our data.
• Wide intervals poor precision
• Narrow intervals high precision
Estimation I: 49
3. The width of the interval is a function of the
confidence coefficient (percentile of a
distribution) and the standard error
greater confidence wider intervals
larger samples more narrow intervals
(as n gets large /n or s/n gets smaller )
Estimation I: 50
x x
2 known
Point Estimate x x
What to use for variance
SE of x n s n
Standardization
z =x - n
t = x - s n
Distribution ofStandard Score Normal(0,1) Student’s t, with df=n-1
Confidence Interval ± z n[ ] ± t s n[ ]
Computing a Confidence Interval for
AssumptionsRandom Sample from Normal, or n large
Random Sample fromNormal distribution
2 NOT known
s