6 inference intervals sample size

7/28/2019 6 Inference Intervals Sample Size

1/48

Inference,confidence

intervals andsample size

determination

Dr James Abdey

Overview

Choosing a sample size

Estimation

Sampling distribution of X

Sampling distribution

properties

Sample size and sampling

fraction

Central Limit Theorem

Principle of confidence

intervals

Construction: CI for X

Variance Known


Variance Unknown

Choosing sample size

Adjusting the statistically

determined sample size

Adjusting for non-response

Applied Marketing

(Market Research Methods)

Topic 6:

Inference, confidence intervals andsample size determination

Dr James Abdey
http://find/


2/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Overview

Here we consider sample size determination insimple random sampling

Properties of the sampling distribution are

discussed

We describe the required adjustments to statistically

determined sample sizes to account for incidence

and completion rates

Non-response issues in sampling are also covered,

with ways of improving response rates
http://find/


3/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






The question How big a sample do I need to take?

is a common one when sampling data The answer to this depends on the quality of

inference that the researcher requires from the data

In the estimation context this can be expressed in

terms of the accuracy of estimation If the researcher requires that there should be a 95%

chance that the estimation error should be no bigger

than d units (we refer to d as the tolerance), then

this is equivalent to having a 95% confidence

interval of width 2d

Note here d represents the half-width of the

confidence interval since the point estimate is, by

construction, at the centre of the confidence interval
http://find/


4/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Simple random sampling (SRS)

Recall a simple random sample is a sample

selected by a process such that every possiblesample (of the same size, n) has the same

probability of selection

The selection process is left to chance, thus

eliminating the effect of selection bias

Due to the random selection mechanism, we do not

know (in advance) which sample will occur

Every population element has a known, non-zero

probability of selection in the sample but no element

is certain to appear
http://find/


5/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Simple random sampling (SRS)

Example

Consider a population of size N = 6 elements: A, B,C, D, E and F

We consider all possible samples of size n= 2

(without replacement)

There are 15 different, but equally likely, such

samples:

AB, AC, AD, AE, AF, BC, BD, BE,

BF, CD, CE, CF, DE, DF, EF

Since this is SRS, each sample has a probability of

selection of 1/15
http://find/


6/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Estimation

A population has particular characteristics of interest

such as the mean, variance etc.

Collectively we refer to these characteristics as

parameters

If we do not have population data, the parametervalues will be unknown

Statistical inference is the process of estimating

the (unknown) parameter values using the (known)sample data

We use a statistic (estimator) calculated from

sample observations to provide a point estimate
http://find/


7/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Estimation Example

Returning to our example, recall there are 15

different samples of size 2 from a population of size 6

Suppose the variable of interest is income

Individual A B C D E F

Income in 000s 3 6 4 9 7 7

If we seek the population mean, , we will use thesample mean, X, as our estimator

X =1

n

ni=1

Xi

For example, if the observed sample was AB, the

sample mean is (3000 + 6000)/2 = 4,500
http://find/http://goback/


8/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Estimation Example

Clearly, different observed samples will lead to

different sample means

Consider X for all possible samples (in 000s):

Sample AB AC AD AE AF BC BD BE

Values 3 6 3 4 3 9 3 7 3 7 6 4 6 9 6 7

X 4.5 3.5 6 5 5 5 7.5 6.5

Sample BF CD CE CF DE DF EF

Values 6 7 4 9 4 7 4 7 9 7 9 7 7 7

X 6.5 6.5 5.5 5.5 8 8 7

So X values vary from 3.5 to 8, depending on the

sample values
http://find/


9/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






The previous slide showed all possible values of the

estimator X

Since we have the population data here, we can

actually compute the population mean (in 000s)

=1

N

Ni=1

Xi =3 + 6 + 4 + 9 + 7 + 7

6= 6

So even with SRS, we obtain someX values far from

Here only one sample (AD) results in X =
http://goforward/http://find/http://goback/


10/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






Lets now consider the maximum | X |

max | X | Number of samples Probability0 1 0.067

0.5 6 0.400

1 10 0.667

1.5 12 0.800

2 14 0.933

2.5 15 1.000

So, for example, there is an 80% chance of being

within 1.5 units of
http://find/


11/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






We now represent this as a frequency distribution

That is, we record the frequency of each possiblevalue of X

X Frequency Relative frequency

3.5 1 1/15 = 0.067

4.5 1 1/15 = 0.067

5.0 3 3/15 = 0.2005.5 2 2/15 = 0.133

6.0 1 1/15 = 0.067

6.5 3 3/15 = 0.200

7.0 1 1/15 = 0.0677.5 1 1/15 = 0.067

8.0 2 2/15 = 0.133

This is known as the sampling distribution of X
http://find/


12/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






The sampling distribution is a central and vital

concept in statistics

It can be used to evaluate how good an estimator is

Specifically, we care about how close the estimator

is to the population parameter of interest

As we have seen, different samples yield different X

values, as a consequence of the random sampling

procedure

Hence estimators (of which X is an example) are

random variables So, X is our estimator of

The observed value of X is a point estimate
http://find/


13/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction

Central Limit TheoremPrinciple of confidence

intervals


Variance Known


Variance Unknown





Sampling distribution properties

Like any distribution, we care about a samplingdistributions mean and variance

Together, we can assess how good an estimator is

First, consider the mean we seek an estimator

which does not mislead us systematically

So the average (mean) value of an estimator, over

all possible samples, should be equal to the

population parameter
http://find/http://goback/


14/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






Returning to our example:

X Frequency Product3.5 1 3.5

4.5 1 4.5

5.0 3 15.0

5.5 2 11.0

6.0 1 6.06.5 3 19.5

7.0 1 7.0

7.5 1 7.5

8.0 2 16.0Total 15 90.0

Hence the mean of this sampling distribution is 90/15

= 6

f


15/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown





Sampling distribution properties An important difference between a sampling

distribution and other distributions is that the values

in a sampling distribution are summary measures ofwhole samples (i.e. statistics/estimators) rather than

individual observations

Formally, the mean of a sampling distribution is

called the expected value of the estimator, denotedby E[]

Hence the expected value of the sample mean is

E[X]

An unbiased estimator has its expected value equal

to the parameter being estimated

For our example, E[X] = 6 =

I fS li di ib i i


16/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






Fortunately the sample mean X is always an

unbiased estimator in SRS, regardless of:

the sample size, n

the distribution of the (parent) population

This is a good illustration of a population parameter,

, being estimated by its sample counterpart, X

InferenceS li di ib i i
http://find/


17/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






The unbiasedness of an estimator is clearly

desirable, however we also need to take into accountthe dispersion of the estimators sampling distribution

Ideally, the possible values of the estimator should

not vary much around the true parameter value

So, we seek an estimator with a small variance

Recall the variance is defined to be the mean of the

squared deviations about the mean of the distribution

In the case of sampling distributions, it is referred to

as the sampling variance

InferenceS li di t ib ti ti
http://find/


18/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






Returning to our example:

X X (X )2 Frequency Product3.5 2.5 6.25 1 6.254.5 1.5 2.25 1 2.255.0 1.0 1.00 3 3.00

5.5 0.5 0.25 2 0.506.0 0.0 0.00 1 0.006.5 0.5 0.25 3 1.75

7.0 1.0 1.00 1 1.00

7.5 1.5 2.25 1 2.25

8.0 2.0 4.00 2 8.00Total 15 24.00

Hence sampling variance is 24/15 = 1.6

InferenceS li di t ib ti ti
http://find/


19/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






The population itself has a variance thepopulation variance, 2

X X (X )2 Frequency Product3 3 9 1 96 0 0 1 04 2 4 1 49 3 9 1 9

7 1 1 2 2

Hence the population variance is 2 = 24/6 = 4

InferenceSampling distrib tion properties
http://find/


20/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown





Sampling distribution properties We now consider the relationship between 2 and

the sampling variance

Intuitively, a larger 2

should lead to a largersampling variance why?

For population size N and sample size n,

Var(X) =N nN 1

2

n So for our example,

Var(X) =6 26

1 4

2= 1.6

We use the term standard error to refer to the

standard deviation of the sampling distribution,

S.E.(X) =Var(X) =N n

N 1

2

n

= X

Inference,Sampling distribution properties
http://find/


21/48



determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






Implications:

as the sample size, n, increases, the samplingvariance decreases, i.e. the precision increases1

provided the sampling fraction, n/N, is small, theterm

N nN 1 1

so can be ignored the precision depends

effectively on n only

1

Although greater precision is desirable, data collection costs willrise with n (remember why we sample in the first place!)

Inference,Sample size and sampling fraction
http://find/


22/48

,confidence


determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown





Sample size and sampling fraction

The larger the sample, the less variability there will

be between samples

X n= 2 n = 43.50 1

4.50 1

5.00 3 2

5.25 1

5.50 2 15.75 3

6.00 1 1

6.25 2

6.50 3

6.75 1

7.00 1

7.25 1

7.50 1

8.00 2

http://find/


23/48

confidenceintervals andsample size

determination

Dr James Abdey

Overview


Estimation



properties


fraction


intervals


Variance Known


Variance Unknown






There is a striking improvement in the precision of

the estimator

The variability has decreased considerably

Range of possible X values goes from 3.5 to 8.0down to 5.0 to 7.25

The sampling variance is reduced from 1.6 to 0.4

Note precision in statistics refers to the inverse of

the sampling variance

http://find/


24/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






The factor NnN1 decreases steadily as n N When n= 1 the factor equals 1, and when n= N it

equals 0

Sampling without replacement, increasing n must

increase precision since less of the population is left

out

In much practical sampling N is very large (e.g.

several million), while n is comparably small (e.g. at

most 1,000, say)

Therefore in such cases the factor NnN1 becomes

negligible, hence

Var(X) =N nN 1

2

n

2

nfor small n/N

http://find/


25/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






n/N is called the sampling fraction

When N is large, it is the sample size nwhich isimportant in determining precision, not the sampling

fraction

Consider two populations: N1 = 3 million andN2 = 200 million, both with the same variance

2

We sample n1 = n2 = 1,000 from each population,then

2X1

=N1 n1N1

1

2

n1= (0.999667)

2

1000

2X2

=N2 n2N2 1

2

n2= (0.999995)

2

1000

So 2X1 2

X2, despite N1


26/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






When sampling from (almost) any non-normal

distribution, for sufficiently large n, X:

1. is approximately normally distributed2. has mean

3. has variance 2

nand standard error

n

The approximation is reasonable for nat least 30, asa rule-of-thumb

Though because this is an asymptotic

approximation (i.e. as n

), the bigger n is, the

better the normal approximation

Special case: if the population distribution is itself

Normal, X will have an exact Normal distribution for

any sample size n

Inference,confidenceCentral Limit Theorem
http://find/


27/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






Below is the sampling distribution of X for small

(red) and large (black) n

As n increases, the sampling variability of X

decreases

4 2 0 2 4

0.0

0.1

0.2

0.3

0.4

Sampling Distribution of Sample Mean

Sample mean

Density

http://find/


28/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






Although the shape of the population distribution

does not affect the generality of the CLT result, it

does affect the speed of convergence of the

sampling distribution of X to the Normal distribution

Obviously a symmetric population distribution would

converge faster in n

In practice, n = 30 is usually adequate to make theNormal approximation reasonable

http://find/


29/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






Remember the CLT is based on SRS

Without probability sampling methods, there is

absolutely no basis for the use of the CLT

This is principally why we insist on probability

(random) sampling

Otherwise the whole structure of statistical inference

collapses!

http://find/


30/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown






The CLT also makes the use of the variance more

reasonable

The Normal distribution is completely characterisedby its mean and variance

Hence it is sensible to focus attention on these two

characteristics of the sampling distribution

Inference,confidencePrinciples of confidence intervals
http://find/


31/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Principles of confidence intervals

A point estimate is our best guess of an unknown

population parameter based on sample data

But as its based on a sample, there is someuncertainty/imprecision

Confidence intervals (CIs) communicate the level

of imprecision

http://find/


32/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





p

Formally, an x% confidence interval covers the

unknown parameter with x% probability overrepeated samples

The shorter the confidence interval, the more reliable

the estimate

As we shall see, this is achievable by:

reducing the level of confidence

increasing the sample size

We now look at how to construct CIs

http://find/


33/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





p

The general format (for our purposes) for a

confidence interval is

statistic

(multiplier coefficient)

standard error

Alternatively,

estimate

margin of error

Inference,confidenceCI for (variance known)
http://find/


34/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





( )

Point estimate for is calculated usingX =

ni=1 Xi

n

Assuming the (population) variance 2 is known, thestandard error of X is

S.E.(X) = X =

N nN 1

2

n

n

Hence a 95% confidence interval for is

X 1.96 n

= X 1.96 n, X + 1.96

n

Inference,confidenceCI for (variance known)
http://find/


35/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





( ) This is a simple, but important result, forming a

useful template

Note the above interval was for 95% confidence

Other levels of confidence pose no problem, but

require a different multiplier coefficient

When the variance (2) is known we obtain amultiplier from the standard normal distribution

For 90% confidence, use the multiplier 1.645



Hence a 99% confidence interval for is

X

2.576

n=

X 2.576

n, X + 2.576

n


i t l d

CI for (variance known)
http://find/


36/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





So we see that a higher level of confidence (a

good thing) leads to a larger multiplier coefficient,and hence a wider confidence interval (a bad

thing)

Hence, other things equal, we face a trade-offbetween level of confidence and width of confidence

interval

Since the width of a CI is part-determined by the

standard error, by increasing n (costly) we willreduce the standard error, hence shorten the CI (a

good thing)


i t l d

CI for (variance unknown)
http://find/


37/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Unfortunately, to use the approach just discussed

requires knowledge of the population variance, 2

This is because it is used in the standard error:

X zn

In practice, we are unlikely to know 2

After all, its a population characteristic, and so if we

do not know , why would we know 2?


intervals and

http://find/


38/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Recall the sampling variance of X is

Var(X) = 2X

=N nN 1

2

n

2

n

But if

2

is unknown we have a problem

It is not that we are fundamentally interested in 2,only that we need to estimate it because the

precision of X depends on it

And there is little point having a point estimate if we

know nothing about its precision


intervals and

http://find/


39/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Our estimate of Var(X) is

s2

X =

N

n

N s2

n s2

n

where

s2 =1

n 1

n

i=1

(xi

x)2 =

1

n 1 n

i=1

x2i

nx2

Our estimate of the standard error is thus

sx =N n

N

s2

n

s

nsince typically NnN 1 in the social sciences

Once we have estimated this, we proceed as before

to construct a CI using the estimate of the standard

error in place of the actual standard error


intervals and

http://find/


40/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





So, for a 90% confidence interval we use

x 1.645 sn

Similarly, for a 95% confidence interval we use

x 1.96 sn

Finally, for a 99% confidence interval we use

x 2.576 sn


intervals and

http://find/


41/48


determination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





Note the trade-off between accuracy and data cost

Solution: fix desired precision and find smallest n

which achieves this

If we want the sample mean to be within a tolerance

d of with a specified probability, then

d = z n

= n= z22

d2

n is the minimum sample size required to achieve the

desired precision nmust be an integer, so always round up!


intervals and

Choosing sample size Example
http://find/


42/48

sample sizedetermination

Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





A random sample is to be taken from a population

with unknown mean and = 3 How big a sample size would be needed if there is to

be a 95% chance of X being within 1 unit of ?

The sample size nrequired for a tolerance of 1

satisfies

1 = 1.96 3n

= n = 34.57 = n= 35

Note that the required sample size in this type ofcalculation needs to be rounded up from a decimal

fraction, since rounding down would result in a value

not quite large enough!


intervals and

Adjusting the statistically determined

l i
http://find/


43/48


Dr James Abdey

Overview


Estimation



properties


fraction



intervals


Variance Known


Variance Unknown





sample size Incidence rate refers to the rate of occurrence, or

the percentage, of persons eligible to participate inthe study

In general, if there are k qualifying factors with an

incidence of Q1, Q2, Q3, . . ., Qk, each expressed as

a proportion:

Incidence rate = Q1 Q2 Q3 . . .Qk The completion rate is the percentage of qualified

respondents who complete the interview, enablingresearchers to account for anticipated refusals by

people who qualify

Initial sample size =Final sample size

Incidence rate

Completion rate


intervals and

http://find/


44/48


Dr James Abdey

Overview


Estimation



properties


fraction


Principle of confidenceintervals


Variance Known


Variance Unknown





Sub-sampling of non-respondents the

researcher contacts a sub-sample of thenon-respondents, usually by means of telephone or

personal interviews

In replacement, the non-respondents in the current

survey are replaced with non-respondents from an

earlier, similar survey

The researcher attempts to contact these

non-respondents from the earlier survey andadminister the current survey questionnaire to them,

possibly by offering a suitable incentive


intervals and

http://find/


45/48


Dr James Abdey

Overview


Estimation



properties


fraction




Variance Known


Variance Unknown





In substitution, the researcher substitutes for

non-respondents other elements from thesampling frame that are expected to respond

The sampling frame is divided into sub-groups that

are internally homogeneous in terms of

respondent characteristics but heterogeneous in

terms of response rates

These sub-groups are then used to identify

substitutes who are similar to particularnon-respondents but dissimilar to respondents

already in the sample


intervals andl i

http://find/


46/48


Dr James Abdey

Overview


Estimation



properties


fraction




Variance Known


Variance Unknown





Subjective estimates when it is no longer feasible

to increase the response rate by sub-sampling,

replacement, or substitution, it may be possible to

arrive at subjective estimates of the nature and effect

of non-response bias

This involves evaluating the likely effects of

non-response based on experience and available

information
http://find/


47/48



Imputation


48/48


Dr James Abdey

Overview


Estimation



properties


fraction




Variance Known


Variance Unknown





Imputation involves imputing, or assigning, the

characteristic of interest to the non-respondents

based on the similarity of the variables available for

both non-respondents and respondents

For example, a respondent who does not report

brand usage may be imputed the usage of a

respondent with similar demographic

characteristics
http://find/

6 inference intervals sample size

Documents