estimation - chapter 6 1 previouslymsz03/sta552/notes/estimation.pdf · estimation - chapter 6 1...

64
ESTIMATION - CHAPTER 6 1 PREVIOUSLY use known probability distributions (binomial, Poisson, normal) and know population parameters (mean, variance) to answer questions such as... ...given 20 births and a probability of low birth weight of p=0.10, what is the probability that 3 or more low birth weight infants are born ... ...given that birth weight is distributed normally with a mean of 3750 grams and a standard deviation of 500 grams, what is the probability that an infant will have a birth weight of at least 2724 grams (6 pounds)...

Upload: ngoliem

Post on 25-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

ESTIMATION - CHAPTER 6 1

PREVIOUSLY

use known probability distributions (binomial, Poisson, normal) andknow population parameters (mean, variance) to answer questionssuch as...

...given 20 births and a probability of low birth weight of p=0.10,what is the probability that 3 or more low birth weight infants areborn ...

...given that birth weight is distributed normally with a mean of3750 grams and a standard deviation of 500 grams, what is theprobability that an infant will have a birth weight of at least 2724grams (6 pounds)...

ESTIMATION - CHAPTER 6 2

NEW PROBLEMS

• how can a sample be used to estimate the unknown parametersof a population

• how can we estimate central tendency (mean, median, mode)

• how can we estimate variability (variance, quartiles, range)

• how can we set intervals around the estimates (how "sure" arewe our estimates)

• of all the measures of central tendency and variability, whichare the "best"

ESTIMATION - CHAPTER 6 3

EXAMPLES

Previously...Hypertension...

assume that the distribution of diastolic blood pressure (DBP) in35-44 year old men is NORMAL with mean µ = 80 and standarddeviation F = 12...one can use this information to determine that95% of all diastolic blood pressures among 35-44 year old menshould fall between 56 and 104

ESTIMATION - CHAPTER 6 4

New Task...Hypertension...

assuming that the distribution of DBP in 34-44 year old men isNORMAL, what is the mean DBP in the population (what is µ andhow precise is the estimate of µ)

New Task...Hypertension...

assume that the distribution of DBP in 34-44 year old men isNORMAL with mean µ = 80 and standard deviation F = 12...youmeasure DBP on a sample of thirty (not 1) 34-44 year old men...istheir evidence to say that the mean of your sample differs from thepopulation mean

ESTIMATION - CHAPTER 6 5

New Task....Infectious Disease...

assuming that the number of cases in the population follows abinomial distribution, what is the prevalence of HIV-positive peoplein a low income census tract in an urban area (what is P, howprecise is the estimate of P)

New Task...Infectious Disease...

assume that in treating gonorrhea, a daily dose of penicillin of 4.8megaunits has a failure rate of 10% (P = 0.10)...you treat 46patients with a daily dose of 4.0 megaunits of penicillin and findthat after a week, 6 patients still have gonorrhea...is their evidenceto show a difference in the failure rate of the two different doses

ESTIMATION - CHAPTER 6 6

WHEN A POPULATION IS LARGE...SAMPLING

given that one cannot measure DBP in ALL 35-44 year old men orobtain results of an HIV test for ALL people in the census tract, onemust rely on a SAMPLE to estimate POPULATION parameters

important considerations...

• reliability

the sample reflects the population

sample(s) are collected in a manner that allows you to estimatehow much sample results might differ from the results thatwould be obtained if the entire population was used

• economics

ESTIMATION - CHAPTER 6 7

types of samples...from Chapter 1 in Triola...

• random...each member of the population has an equalprobability of being chosen

• simple random...a sample of size N from a population such thateach possible sample of size N has an equal probability of beingchosen

ESTIMATION - CHAPTER 6 8

• systematic...select every Kth member of the population (every5th, 10th, 100th, ...)

• stratified...divide the population into subgroups (strata) basedon some attribute of population members (gender, age, ...),then select a simple random sample within each subgroup)

• cluster...divide the population in subgroups (clusters), randomlyselect a clusters, then select all members (or randomly sample)within the selected clusters

• convenience...whatever is available

links...

Introduction to SamplingBias in Sampling

ESTIMATION - CHAPTER 6 9

ESTIMATES OF CENTRAL TENDENCY AND VARIATION

central tendency... variation...mean variancemedian quartilesmode range

what's best...one quality is "unbiased"...the averagevalue of the estimate over a large number ofrepeated samples is the population value

ESTIMATION - CHAPTER 6 10

Triola...given a population with three members...1, 2, 5...

SAMPLE MEAN MEDIAN RANGE VAR SD1,1 1.0 1.0 0 0.0 0.0001,2 1.5 1.5 1 0.5 0.7071,5 3.0 3.0 4 8.0 2.8282,1 1.5 1.5 1 0.5 0.7072,2 2.0 2.0 0 0.0 0.0002,5 6.5 3.5 3 4.5 2.1215,1 3.0 3.0 4 8.0 2.8285,2 6.5 3.5 3 4.5 2.1215,5 5.0 5.0 0 0.0 0.000

MEAN 2.7 2.7 1.8 2.9 1.3POPULATION 2.7 2 4 2.9 1.7UNBIASED Y N N Y N

ESTIMATION - CHAPTER 6 11

• another quality...estimator of central tendency with theminimum variance...Rosner uses an example with 200 samplesof N=200 from a population of 1000 births...the distribution ofthe sample means, medians, mean of the high/low values areshown

• the distribution of the sample means is said to have theminimum spread (conclusion...minimum variance)

• the same is true for sample estimates of proportions

ESTIMATION - CHAPTER 6 12

• another illustration...use SAS to generate a population of 10,000numbers that are normally distributed with a mean of 3750 anda standard deviation of 500

• in notation terminology...X~N(3750,250000)

• the mean and standard deviation were chosen to approximate abirth weight distribution

ESTIMATION - CHAPTER 6 13

the 10,000 valuesof x are distributedas shown on theright (looks like aNORMALdistribution) andthey have thefollowingcharacteristics...

Mean Std Dev Variance Minimum Maximum 3750.4 501.7 251729.8 1831.6 5728.8

ESTIMATION - CHAPTER 6 14

• the next step is to take 100 simple random samples of N=10from the 10,000 values

• then compute the variance of the mean, median, and mean ofthe high/low values

• results...

Variable Mean Variancemean 3755.4 23715.4median 3759.6 33945.0mean high/low 3757.6 38955.6

• all point estimates means are close to the population mean, butthe mean has the smallest variance

• mean...minimum variance unbiased estimator

ESTIMATION - CHAPTER 6 15

STANDARD ERROR

• the variability of a population is measured by the variance (orstandard deviation)

• the variability of a set of sample means obtained by repeatedrandom samples of size N from a population is measured by thestandard error of the mean, defined as...F / qN

• the standard error IS NOT the standard deviation of individualvalues...it IS the standard deviation of sample means...repeateda number of times in Rosner (also discussed in BMJ handout)

• standard error a function of population variability and samplesize

ESTIMATION - CHAPTER 6 16

CENTRAL-LIMIT THEOREM

• if a sample is large enough, the distribution of sample meanscan be approximated by a normal distribution...allows us tomake quantitative statements about the VARIABILITY of thesample mean

• the above holds regardless of the distribution of the underlyingpopulation

ESTIMATION - CHAPTER 6 17

(from Triola)

• given...

random variable x that may or may not be distributed normallywith mean µ and standard deviation F...and...simple randomsamples all of the same size N

• then...

the distribution of the sample means approaches a normaldistribution as the sample size increases...and...the mean of allthe samples is the µ, the population mean...and...the standarddeviation of all the sample means is F / qN

ESTIMATION - CHAPTER 6 18

• practical considerations...

a common guideline is that if the original population IS NOTnormally distributed, sample sizes greater than 30 result in adistribution of sample means that can be approximated well bya normal distribution...and...if the original population ISdistributed normally, the sample means will be distributednormally for any size sample

ESTIMATION - CHAPTER 6 19

illustration of thecentral-limittheorem...samplingfrom a populationthat IS NOTnormallydistributed...10,000values of X with aUNIFORMdistribution...

Mean Std Dev Variance Minimum Maximum 499.9 288.2 83071.1 1 1000

ESTIMATION - CHAPTER 6 20

• use SAS to take a series of 100 simple random samples...threedifferent sizes...N=3, N=10, N=50

• what is the distribution of sample means

• what is the variability of the sample means (use standard errorof the mean and the t-distribution to construct 95% confidenceintervals around the sample means)...this jumps ahead a bit inRosner to the t-distribution and a discussion of confidenceintervals

ESTIMATION - CHAPTER 6 21

ESTIMATION - CHAPTER 6 22

ESTIMATION - CHAPTER 6 23

ESTIMATION - CHAPTER 6 24

ESTIMATION - CHAPTER 6 25

ESTIMATION - CHAPTER 6 26

ESTIMATION - CHAPTER 6 27

• repeat the process, but this time sample from a population of10,000 that is normally distributed with a mean of 3750 and astandard deviation of 500 (see slides 10-11)

• use SAS to take a series of 100 simple random samples...threedifferent sizes...N=3, N=10, N=50

• what is the distribution of sample means

• what is the variability of the sample means (use standard errorof the mean and the t-distribution to construct 95% confidenceintervals around the sample means)...this jumps ahead a bit inRosner to the t-distribution and a discussion of confidenceintervals

ESTIMATION - CHAPTER 6 28

ESTIMATION - CHAPTER 6 29

ESTIMATION - CHAPTER 6 30

ESTIMATION - CHAPTER 6 31

ESTIMATION - CHAPTER 6 32

ESTIMATION - CHAPTER 6 33

ESTIMATION - CHAPTER 6 34

ESTIMATING A POPULATION PROPORTION

• sample proportion used to estimate the population proportion (apoint estimate, central tendency)

• normal approximation to the binomial involved in estimatingprecision of the point estimate

• construct a confidence interval based (0.90, 0.95, 0.99)

• all requirements for using the normal approximation to thebinomial are met (binomial: fixed number of trial, independenttrials, two outcomes, P constant over trials --- and --- both NPand NQ => 5)

ESTIMATION - CHAPTER 6 35

• confidence intervals use the standard error of a proportion... rpq/n

• margin of error...E = z * standard error

• Triola genetics example...

given 580 offspring peas and 26.2% with yellow pods

based on the above, what should one conclude about Mendel'stheory that 25% of peas will have yellow pods

ESTIMATION - CHAPTER 6 36

• fixed number of trial = 580independent trialstwo outcomes (yellow/not yellow)P constant over trialsNP = 580(0.262) = 152NQ = 580(0.738) = 428

• standard error = rpq/n = r(0.262)(0.738)/580 = 0.0182695% confidence interval, Z = 1.96margin of error, E = 1.96 * 0.01826 = 0.03579P-E = 0.262 - 0.03579 = 0.226P+E = 0.262 + 0.03570 = 0.298

• 95% confidence interval... 0.226 < P < 0.298

WHAT DOES THIS MEAN? WHAT ABOUT MENDEL?

ESTIMATION - CHAPTER 6 37

ESTIMATION - CHAPTER 6 38

• critical values for z in determining confidence intervals...

two-sided confidence intervals

equal area in tails of normal distribution

90% 5% of area in each tail z =1.645

95% 2.5% of area in each tail z= 1.96

99% 0.5% of area in each tail z = 2.575

ESTIMATION - CHAPTER 6 39

SAMPLE SIZE TO ESTIMATE A POPULATION PROPORTION

• how large should a sample be to have a pre-selected margin oferror

• margin of error... E = z * standard error

• standard error... rpq/n

• margin of error... E = z * rpq/n

• solve for n... n = z2 * pq / E2

• sample size is a function of z (confidence level), p and q(estimate of proportion in population), E (margin of error)

ESTIMATION - CHAPTER 6 40

EXAMPLE FROM TRIOLA

• what proportion of US households use e-mail (95% confidentthat the estimate is in error by no more than 4%)

• n = z2 * pq / E2

• two scenarios...

previous study estimated p = 0.169n = 1.962 * (0.169 * 0.831) / 0.042 = 337.194 = 338

no previous estimate of pn = 1.962 * (0.5 * 0.5) / 0.042 = 600.25 = 601

ESTIMATION - CHAPTER 6 41

• how does sample size vary with margin of error...assuming noprior knowledge of the population proportion...

SAMPLE SIZE E 95% 99%

0.005 38,416 66,3070.010 9,604 16,5770.015 4,269 7,3680.020 2,401 4,1450.025 1,537 2,6530.030 1,068 1,8420.035 784 1,3540.040 601 1,0370.045 475 8190.050 385 664

ESTIMATION - CHAPTER 6 42

• how does sample size vary with the assumed populationproportion and a given margin of error...assuming margin oferror is 3%...

SAMPLE SIZE P 95% 99%

0.1 385 6640.2 683 1,1790.3 897 1,5480.4 1,025 1,7690.5 1,068 1,8420.6 1,025 1,7690.7 897 1,5480.8 683 1,1790.9 385 664

• BOTH TABLES ASSUME A RANDOM SAMPLE THAT ISREPRESENTATIVE OF THE POPULATION --- A LARGE SAMPLECANNOT FIX BAD SAMPLING

ESTIMATION - CHAPTER 6 43

ESTIMATING A POPULATION MEAN

• sample mean best estimate of the population mean (shownearlier as unbiased, minimum variance as compared to othermeasures of central tendency)

• two scenarios

• population variation (F) known

unusual...you know F, but do not know µ

• population variation (F) unknown

more likely...no knowledge of F or µ

ESTIMATION - CHAPTER 6 44

POPULATION VARIATION (F) KNOWN

• all the requirements about sampling are met

• normality...remember the CENTRAL LIMIT THEOREM

population is normally distributed

population is not normally distributed, but sample size is >30

ESTIMATION - CHAPTER 6 45

• confidence intervals use the standard error of the mean...F / rn

• margin of error...E = z * standard error

• Triola body temperature example...

given 106 measurements of temperature with a mean of 98.2Fand a known population F of 0.62F, what is a 95% confidenceinterval for µ

ESTIMATION - CHAPTER 6 46

• standard error = F / rn = 0.62 / r106 = 0.0602295% confidence interval, Z = 1.96margin of error, E = 1.96 * 0.06022 = 0.11083mean-E = 98.2 - 0.11083 = 98.08mean+E = 98.2 + 0.11083 = 98.32

• 95% confidence interval... 98.08 < µ < 98.32

WHAT DOES THIS MEAN? WHAT DOES IT SAY ABOUT WHATIS CONSIDERED A NORMAL TEMPERATURE, 98.6F?

ESTIMATION - CHAPTER 6 47

SAMPLE SIZE TO ESTIMATE A POPULATION MEAN

• how large should a sample be to have a pre-selected margin oferror

• margin of error... E = z * standard error

• standard error... F / rn

• margin of error... E = z * F / rn

• solve for n... n = (z * F / E)2

• sample size is a function of z (confidence level), F (variability inthe population), E (margin of error)

ESTIMATION - CHAPTER 6 48

EXAMPLE FROM TRIOLA

• sample size to estimate the mean IQ score of statisticsprofessors with 95% confidence that the sample mean is within2 IQ points of the population mean

z = 1.96 (95% confidence)

F = 15 (IQ test designed to have: µ = 100, F = 15)

E = 2 IQ points

n = (1.96 * 15 / 2)2

n = 216.09 = 217

(easy to see the effect of changing E)

ESTIMATION - CHAPTER 6 49

POPULATION VARIATION (F) UNKNOWN

all the requirements about sampling are met

• normality...remember the CENTRAL LIMIT THEOREM

population is normally distributed

population is not normally distributed, but sample size is >30

• confidence intervals use the standard error of the mean...F / rn

• margin of error...E = t * standard error

ESTIMATION - CHAPTER 6 50

T-DISTRIBUTION

• central-limit theorem...sample means are distributed normallywith mean µ and standard deviation (standard error) F / qN

• 95% of all sample means over a large number of samples ofsize N will fall between µ - 1.96 F / qN and µ + 1.96 F / qN

• convert to sample means to z-scores (subtract µ and divide bythe standard error)

• assumes that the population standard deviation F is known

ESTIMATION - CHAPTER 6 51

z X s N= −( ) / ( / )µ

• F rarely known...use sample data to estimate F

• z-scores computed using an estimate of the population standarddeviation...

are NOT NORMALLY distributed

• z-scores computed using an estimate of the population standarddeviation follow a t-distribution (Student's t) and there aremultiple t-distributions that are a function of N, the sample size

ESTIMATION - CHAPTER 6 52

the shape of thet-distribution issymmetric andsimilar to that ofthe standardnormaldistribution

the shapedepends ondegrees offreedom (DF)

ESTIMATION - CHAPTER 6 53

the t-distributionis more spreadout than thestandard normaldistribution

one has to movefurther to theleft (or right) toencompass thesame area(probability) asthe standardnormal curve

ESTIMATION - CHAPTER 6 54

as the degreesof freedomincrease, theshape of the t-distributionmore closelyresembles thatof the standardnormal curve

ESTIMATION - CHAPTER 6 55

reason...as DFincrease, s (thesample standarddeviation)becomes abetter estimateof F (thepopulationstandarddeviation)

ESTIMATION - CHAPTER 6 56

using the standard normalcurve, z withp=.975 is 1.96

using the t-distribution, zwith p=.975varies with DF...

DF z4 2.7769 2.26229 2.04560 2.0004 1.960

ESTIMATION - CHAPTER 6 57

• Triola body temperature example...

given 106 measurements of temperature with a mean of 98.2Fand standard deviation of 0.062F and a unknown population F,what is a 95% confidence interval for µ

ESTIMATION - CHAPTER 6 58

• standard error = F / rn = 0.62 / r106 = 0.0602295% confidence interval with 105 DF, t = 1.984margin of error, E = 1.984 * 0.06022 = 0.11948mean-E = 98.2 - 0.11948 = 98.08mean+E = 98.2 + 0.11948 = 98.32

• 95% confidence interval... 98.08 < µ < 98.32

looks the same as the known F example, but that is ONLY dueto rounding

WHAT DOES THIS MEAN? WHAT DOES IT SAY ABOUT WHATIS CONSIDERED A NORMAL TEMPERATURE, 98.6F?

ESTIMATION - CHAPTER 6 59

• Triola corn example...

given 11 estimates of corn yield (pounds per acre), construct a95% confidence interval estimate of the mean yield...1903 1935 1910 2496 2108 19612060 1444 1612 1316 1511

mean = 1841.5, standard deviation = 342.7standard error = 342.7 / r11 = 103.328

05% confidence interval with 10 DF, t = 2.228margin of error, E = 2.228 * 103.328 = 230.215mean-E = 1841.5 - 230.215 = 1611.285mean+E = 1841.5 + 230.215 = 2071.715

95% confidence interval... 1611.3 < µ < 2071.7

WHAT DOES THIS MEAN?

ESTIMATION - CHAPTER 6 60

ESTIMATING POPULATION VARIABILITY

• point estimate of variance (standard deviation)

• interval estimate of variance (standard deviation)... confidenceintervals

• sample size required to estimate variance (standard deviation)

ESTIMATION - CHAPTER 6 61

REQUIREMENTS

• same sampling rules apply as when estimating central tendency(mean, proportion)... random sample, sample represents thepopulation)

• normality... no Central Limit Theorem... departures fromnormality result in "gross" errors

ESTIMATION - CHAPTER 6 62

DISTRIBUTIONS

• central tendency... normal and t-distribution

• variability... chi-square (P2)

repeated samples from a normal distribution

calculate the variance of each sample

variance follow a P2 distribution

ESTIMATION - CHAPTER 6 63

• P2 not symmetric, shape varies by degrees of freedom

• values are 0 or positive

ESTIMATION - CHAPTER 6 64

• lack of symmetry... cannot use F2 +/- E (margin of error)

• confidence interval...

(n-1)s2 / P2(right) < F2 < (n-1)s2 / P2(left)

• example in Triola... body temperatures...

n = 106, s = 0.62F, s2 = 0.3844

(106 - 1) 0.3844 / 129.561 < F2 < (106 - 1) 0.3844 / 74.222

0.31 < F2 < 0.54

0.56 < F < 0.74