estimation i: 1 an introduction to estimation. estimation i: 2 biostatistics serves two purposes...

50
Estimation I: 1 An Introduction to Estimation

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Estimation I: 1

An Introduction to Estimation

Estimation I: 2

Biostatistics serves two purposes (among others):

To use information from a sample of data to

1. Describe our best guess of the characteristics of

the population.

• Best guess estimation

2. Gauge the plausibility of alternative explanations

for what is observed in the sample.

• Hypothesis testing

Estimation I: 3

Estimates vs Parameters:

Sample Statistics Population Parameters

Mean:Variance: S2

Mean: Variance: 2

What does it mean to say we know X from a sample, but we don‘t know ?

• We can observe a sample mean and then use it as an estimate of the true, unknown population mean.

Standard Deviation: S Standard Deviation:

X

Estimation I: 4

Some Notation / Definitions

1. Estimation: The computation of a statistic from sample data for the purposes of obtaining a guess of the unknown population parameter value.

2. Estimator: The label given to the statistic that is to be calculated in estimation.e.g., sample mean,

sample standard deviation, S

1. Estimate: The value of the estimator takes when calculated using an actual sample of data.e.g., x = 10 minutes

s = 3 minutes

X

Estimation I: 5

What criteria should we use to define a good estimate?

1) In the long run, “correct”if we imagine sampling over and over, the average of repeated sampling should result in the correct answer: UNBIASEDNESS

2) In the short run, “in error by as little as possible”(most of the time, it should be “close” to the true value)

This is the concept of precision.

It is also called the statistical concept of

minimum variance.

Estimation I: 6

Example: Is the sample mean or the sample median a better choice as an estimate of µ, the true population mean, for the normal distribution?

1. Unbiasedness:Both the sample mean and the median are unbiased estimates of µ. (Note this is true for the Normal Distribution, but does not hold for all distributions).

2. Precision:For the sampling distributions of sample means and sample medians, it can be shown that

Variance(sample means) < Variance(sample medians)

For X~ Normal, the sample mean is said to be a minimum variance unbiased estimator (mvue) of µ.

Estimation I: 7

If the data are normally distributed:

X ~ N(, 2) ~ N(, 2/n)

That is, we know that the sampling distribution of sample means from this population will follow

• a Normal distribution with

• the same mean as the underlying population

• A decreased variance relative to the underlying population: 2/n

X

Estimation I: 8

We can then make statements about the probability of observing a value of within some interval around :

n n

68%

1.96n 1.96

n

95%

• is within 1 standard error of 68% of the time.

• is within 1.96 standard errors of 95% of the time.

• is within 2.576 standard errors of 99% of the time.

X

X

XX

Estimation I: 9

2 types of Estimators

Point Estimators Interval Estimators

Single best guess

Form of estimate:

a value

e.g., x = 10 ml

Range of values

Form of estimate:

(lower limit, upper limit)

e.g., (5, 15) ml

We’ve been working with point estimators;

Our next step is to define interval estimators.

Estimation I: 10

There are 3 ingredients to a confidence interval:

1. a point estimator (e.g., x)

2. the SE of the point estimator (e.g., /n)

3. a confidence coefficient with an associated probability (e.g., a percentile of a Normal distribution)

Estimation I: 11

The form of the confidence interval is then:

Lower limit = (Point Estimator) – [(conf. coeff)

(SE)]

Upper limit = (Point Estimator) + [(conf. coeff)

(SE)]

For example, for a mean:

LL = – c (/n)

UL = + c (/n)

Next Step: where does ‘c’ come from?

X

X

Estimation I: 12

Interpretation of a 95% Confidence Interval

Population

( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )( )

( )

x

In repeated sampling, each sample gives rise to a point and interval estimate of the mean.

Estimation I: 13

Each sample gives rise to its own

• point estimate and

• confidence interval estimate built around the

point estimate.

We will construct our intervals so that:

• If all possible samples of a given sample size

were drawn from the underlying distribution and

• each sample gave rise to its own interval estimate

• then 95% of all such intervals would include the

unknown µ while 5% would not

Estimation I: 14

Interpreting Confidence Intervals:Example: Take a sample, estimate xbar, and compute an interval estimate for the mean: (1.3, 9.5)

Correct:“95% of intervals constructed in this manner will include the population mean .”

Incorrect:“The probability that the interval (1.3, 9.5) contains µ is 0.95.”

The 2nd statement is incorrect because once an interval is constructed it either contains the mean or it doesn’t. Since we don’t know mu, we just don’t know which it is.

Estimation I: 15

Computing a Confidence Interval for

Case 1: 2 known

Example: The weight in micrograms of a drug inside of capsules is Normally distributed, with 2 = 0.25

We are given a sample of n=30 capsules with a mean weight of 0.51 micrograms, and asked to construct a 95% confidence interval estimate of the population mean weight.

1. The point interval estimate is x = 0.51 gm

2. The standard error of x is:

0.250.091

30x gm

n

Estimation I: 16

3. We set the confidence coefficient to equal

z0.975=1.96 for a 95% confidence interval.

We can then compute the lower and upper limits as:

0.51 ( 1.96)(0.091) 0.332lLL x zn

0.51 (1.96)(0.091) 0.688uUL x zn

The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms.

Estimation I: 17

Deriving the expression for a Confidence Interval

We want: Pr (zl Z zu) = 1 = 0.95

zl = 2.5th percentile of Normal

zu = 97.5th percentile of Normal

0

This shaded area is /2 = .025

This area is (1- ) = .95

zl zu

We split the excluded area (=.05) symmetrically around the mean.

This shaded area is /2 = .025

Estimation I: 18

Recall that we want the Inverse Cumulative Distribution to find a percentile of the Normal.

In Minitab: Calc Probability Distributions Normal

percentile:

.025, .975

Estimation I: 19

We can find the percentiles for:

/2 = .05/2 = .025 z.025 = -1.96

and

1 – /2 = 1 – (.05/2) = .975 z.975 = 1.96

We now have: Pr[-1.96 Z 1.96) = .95

0

This shaded area is /2 = .025

This area is (1- ) = .95

z.025

This shaded area is /2 = .025

z.975

Estimation I: 20

Derivation of a confidence interval on X:

We know (simple standardization): /

XZ

n

Pr( ) Pr

/l u l u

Xz Z z z z

n

Pr l uz X zn n

Pr l uX z X zn n

Substitute in for Z

Multiply by SE

Subtract X

This is of form: (pt. estimate) + (conf. coeff) (std error)

Pr l uX z X zn n

Estimation I: 21

When zl=-1.96 and zu=1.96 ,

We can then compute the lower and upper limits as:

The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms.

Pr 0.95l uX z X zn n

0.51 1.96 0.091 0.332

0.51 1.96 0.091 0.688

l

u

LL x zn

UL x zn

Estimation I: 22

Bottom Line:

ConfidenceInterval

Estimate

PointEstimate

ConfidenceCoefficient

StdError

X Percentile From N(0,1) n

Commonly used Confidence Coefficients from N(0,1):

• For a 90% confidence interval z.95 = 1.645

• For a 95% confidence interval z.975 = 1.96

• For a 99% confidence interval z.995 = 2.576

Estimation I: 23

Example

a. What is the 95% confidence interval estimate

of the year 2004 population mean height?

b. the 99% confidence interval estimate?

In 1990 U.S. census, the mean height of men in the US was = 69 inches, and = 3 inches.

By the year 2004 heights may have changed but we will assume the standard deviation is the same, and known: = 3 inches.

Since we can’t afford to measure the whole population we’ll take a sample of n=100 men.

We observe a mean height of x = 70 inches.

Estimation I: 24

Solution

This area is .025

z.975=1.96

We want (1) = 0.95 for a 95% confidence interval:

.95

370 ( 1.96)( ) 69.4

100lLL x zn

370 (1.96)( ) 70.6

100lUL x zn

Estimation I: 25

Do we have evidence that heights of men have changed since 1990?

That is, is the 1990 mean height within this interval?

Since 69 inches is outside of the 95% confidence interval, and most samples would result in a confidence interval that includes the population mean, it seems reasonable to conclude that heights of men in the U.S. may have changed.

We suspect that the mean height is greater in 2004 than in 1990.

The 95% C.I. estimate of the population mean height is (69.4, 70.6) inches.

Estimation I: 26

b. For desired confidence = 0.99

This area is 0.005

z.995 2.576

.99

.995

399% 70 (2.576)( )

100CI x z

n

The 99% confidence interval estimate is (69.2, 70.8) inches.

Since 69 inches is outside of the 99% confidence interval, it seems reasonable to conclude that heights of men in the U.S. might have changed. The average height in 2004 appears greater than in 1990.

Estimation I: 27

Note that the 99% confidence interval is wider than the 95% CI.

• To have greater confidence that we know the mean, based upon the same sample, we have a wider interval of values for .

x

69.2 69.4 70 70.6 70.8

95% CI

99% CI

Estimation I: 28

Hint on the confidence coefficient:

For a (1) C.I.:

use the 1 percentile of the N(0,1).

( 1 - )

Z1-(/2)

For example, for a 95% confidence interval, (1– ) = .95

Thus, = .05 so that /2 = .025

1 – (/2) = 1 – .025 = .975

Thus, we want the .975 or 97.5th percentile for a 95% confidence interval.

Z/2

Estimation I: 29

Another Example:

A random sample of 25 women has a mean systolic

blood pressure = 120 mmhg. Assuming the

underlying distribution of SBP across women is

Normal with = 10 mmhg, find the 99% confidence

interval estimate of the unknown true mean, µ.

Solution:

1. Point Estimate: x = 120

. = 10 is known

x= / n = 10 / 25 = 10 / 5 = 2

Estimation I: 30

0.99 This area =.005

This area =.005

3. To get confidence coefficient:

(1) = 0.99 = .01

/2 = .005 1– (/2) = .995

z.995 = 2.576

z.995 = 2.576

With 99% confidence, the mean systolic blood pressure of the population of women that this sample represents, is between 114.9 and 125.2 mmhg.

4. Confidence Interval: Pt. Est. (Conf.Coeff)(SE)

.99599% 120 (2.576)(2) (114.9,125.2)CI x zn

Estimation I: 31

Recap:A confidence interval for a mean has the form:

This holds when:

1. The data are normally distributed, with known variance, 2

2. The data are not normally distributed, but the sample size n is large, so that the sampling distribution of the mean is approx. normal

ConfidenceInterval

Estimate

PointEstimate

ConfidenceCoefficient

StdError

X Percentile From N(0,1) n

Estimation I: 32

Up to this point we have been looking at

• estimation of an unknown population mean ()

• using data from a sample ( x and C.I.)

• assuming that we “know” the population variance, 2.

In reality, we typically have to

• estimate the population variance, using s2

• along with estimating the mean

• from the same sample.

How does this effect confidence interval estimation for a mean?

Estimation I: 33

What do we do if is UNKNOWN?

2. It seems like a reasonable idea to replace with “s”Recall: s is the sample standard deviation

1. We know how to calculate a confidence interval

for the mean when 2 is known:

2 2 2

1

1, ( )

1

n

ii

where x xn

S S S

Note that estimation of depends upon our estimate of µ .

1 /2X Zn

Estimation I: 34

3. The snag is that we can no longer use the multiplier

z from the Normal distribution. In particular

~ (0,1)/

XZ N

n

~ ? ( )

/

Xt not NormalS n

When we replace the true (but unknown) value of the standard error with an estimate of the standard error:

• Instead of a Z-score, we now have

• A t-statistic:

/

XtS n

Estimation I: 35

This random variable, t, is said to follow

• a Student’s t-distribution

• with degrees of freedom = n-1

• IF the underlying data come from a Normal

Distribution!!

That is:

If

Then1~

/n

Xt tS n

2~ ( , )X N

Estimation I: 36

Features of the Student’s t-Distribution

1~/

n

Xt tS n

The Student’s t distribution is Bell-shaped Symmetric about zero Flatter than the Normal (0,1). This means

• Variability is greater• More area under the tails, less at center• Resulting confidence intervals will be wider.

N(0,1)

0

tn-1

Estimation I: 37

This greater variability or spread of the t-distribution should make intuitive sense –

• we are using an estimate of the standard error rather than the true value

we have added uncertainty in our confidence interval

Each degree of freedom (df) defines a separate t-distribution

• The greater the df, the closer to the normal distribution df = n-1

As n gets large, tn-1 N(0,1)

Estimation I: 38

t df = 5

t df = 25

Normal (0,1)

As n gets large, tn-1 N(0,1)

Estimation I: 39

d.f. t .90 … t .9951 3.078 63.657

2 1.886 9.925

How to Use the Table of Percentiles of the Student’s t-distribution

Table 5 (p. 757) in Rosner

Each row gives information for a separate

t-distribution defined by the df=n-1

The column heading tells you which percentile

will be given to you in the body of the table.

The body of the table is comprised of values

of the percentile

. . .

Estimation I: 40

This number is the percentile in the body of the table.

t-distribution with 1 df

3.078

.90

This area = 0.90

• From the first row (df=1), under the column t.90

• Read tdf=1,.90 = 3.078

That is,

• Pr(tdf=1 3.078) = .90

Estimation I: 41

Using Minitab to Get Percentiles of the t-distribution

Calc Probability Distributions t…

Select Inverse Cumulative Prob to get percentile

Enter df = n-1

Enter desired Percentile

Estimation I: 42

Inverse Cumulative Distribution Function

Student's t distribution with 1 DF

0.9000 3.0777

Using Minitab to Get Percentiles of the t-distribution

tdf=1,.90 = 3.078

.90

t-distribution with 1 df

1,0.90t1,0.90( )P t t

Estimation I: 43

Computation of a confidence interval for the mean when the population variance is unknown:

• Replace the confidence coefficient from the N(0,1)

• with one from tn-1

• use an estimate of the standard error in place of the true standard error

When 2 known

When 2 UNknown

1 / 2( )( / )X z n

1;1 / 2( )( / )nX t nS

Estimation I: 44

Example

A random sample of n = 20 recent cardiac bypass surgeries has a mean duration of x = 267 minutes, and sample variance s2 = 36,700 min2.

Assuming the underlying distribution of surgery duration is normal with unknown variance, find the 90% CI estimate of the true mean duration of surgery, .

1. x = 267

2. s = 36700 = 191.6

3. se(x) = s / n = 191.6 / 20 = 42.8

Estimation I: 45

4. Find the confidence coefficient from the t-distribution

a. n = 20 df = 19

b. For (1) = .90 = .10 /2 = .05

1/2 = .95

c. tdf;1/2 = t19;.95 = 1.729

This area = .05

Look up value ofthe 95th percentile

90%t19

-1.73 t19;.95 =1.729

This area = .0595%

Estimation I: 46

A 90% confidence interval for the true mean duration of surgery is (193.2, 340.8) minutes.

5. 90% CI = Pt. Est. (Conf Coeff)(Std Error Est.)

= x (t19;.95) ( s / n )

= 267 (1.729) (42.8)

= (193.2 , 340.8)

Estimation I: 47

Estimation Highlights / Main Points

1. Estimation provides “guesses” of unknown population parameter values using information from a sample

2. While there may be many criteria for the selection of a “good” estimate, we’ll use two criteria:

1) UNbiasedness (in the long run, correct)

2) Minimum variance (in the short run, the smallest error possible)

Estimation I: 48

1. We can calculate both point and interval

estimates

2. Confidence interval estimates have the

advantage of providing a sense of the

precision in our data.

• Wide intervals poor precision

• Narrow intervals high precision

Estimation I: 49

3. The width of the interval is a function of the

confidence coefficient (percentile of a

distribution) and the standard error

greater confidence wider intervals

larger samples more narrow intervals

(as n gets large /n or s/n gets smaller )

Estimation I: 50

x x

2 known

Point Estimate x x

What to use for variance

SE of x n s n

Standardization

z =x - n

t = x - s n

Distribution ofStandard Score Normal(0,1) Student’s t, with df=n-1

Confidence Interval ± z n[ ] ± t s n[ ]

Computing a Confidence Interval for

AssumptionsRandom Sample from Normal, or n large

Random Sample fromNormal distribution

2 NOT known

s