sample size estimation in prevalence studies

7
SPECIAL ARTICLE Sample Size Estimation in Prevalence Studies Ravindra Arya & Belavendra Antonisamy & Sushil Kumar Received: 8 August 2011 / Accepted: 5 April 2012 / Published online: 6 May 2012 # Dr. K C Chaudhuri Foundation 2012 Abstract Estimation of appropriate sample size for preva- lence surveys presents many challenges, particularly when the condition is very rare or has a tendency for geographical clustering. Sample size estimate for prevalence studies is a function of expected prevalence and precision for a given level of confidence expressed by the z statistic. Choice of the appropriate values for these variables is sometimes not straight-forward. Certain other situations do not fulfil the assumptions made in the conventional equation and present a special challenge. These situations include, but are not limited to, smaller population size in relation to sample size, sampling technique or missing data. This paper discusses practical issues in sample size estimation for prevalence studies with an objective to help clinicians and healthcare researchers make more informed decisions whether review- ing or conducting such a study. Keywords Sample size estimation . Prevalence . Precision Introduction Sample size estimation is a key issue in design of most studies. In a study conducted to estimate prevalence of a given condition in a geographic area, the objective is to sample sufficient population to get adequate number of subjects correctly classified as having the condition of in- terest or not, with a given confidence about the amount to which this estimate might be affected by sampling error. Several factors may potentially contribute to difficulty in this regard, including misclassification of diseased and healthy subjects due to limitations of diagnostic test, the given condition being very common or uncommon, or, limitations of sampling technique. This paper discusses the issues involved in sample size estimation for prevalence studies with an objective to provide practically useful infor- mation to clinicians and healthcare researchers. Let us say our objective is to determine the prevalence of a disease condition in the population living in an isolated geographic area. Here, one simply cannot perform the diag- nostic test on the whole population. Such an approach would be impractical due to logistic or financial reasons, and, ethically unacceptable. Moreover, there would still be misclassification because the predictive values of the test would not be 100 %. So our target is to find out sufficient number of people sampled at random from this population, which if subjected to diagnostic testing, would yield an estimate of the prevalence of the condition in the given population with some confidence. It can be argued that if the condition is very rare in the given population, a larger sample would be required to yield sufficient number of R. Arya Comprehensive Epilepsy Centre, Division of Neurology, Cincinnati Childrens Hospital Medical Centre, Cincinnati, OH, USA B. Antonisamy Department of Biostatistics, Christian Medical College, Vellore, Tamil Nadu, India S. Kumar Division of Basic and Translational Research, Department of Surgery, University of Minnesota, Minneapolis, MN, USA R. Arya (*) MLC 2015, Cincinnati Childrens Hospital Medical Centre, 3333 Burnet avenue, Cincinnati, OH 45229, USA e-mail: [email protected] Indian J Pediatr (November 2012) 79(11):14821488 DOI 10.1007/s12098-012-0763-3

Upload: drrajeevkabad

Post on 12-Apr-2015

117 views

Category:

Documents


7 download

DESCRIPTION

Sample Size Estimation

TRANSCRIPT

Page 1: Sample Size Estimation in Prevalence Studies

SPECIAL ARTICLE

Sample Size Estimation in Prevalence Studies

Ravindra Arya & Belavendra Antonisamy &

Sushil Kumar

Received: 8 August 2011 /Accepted: 5 April 2012 /Published online: 6 May 2012# Dr. K C Chaudhuri Foundation 2012

Abstract Estimation of appropriate sample size for preva-lence surveys presents many challenges, particularly whenthe condition is very rare or has a tendency for geographicalclustering. Sample size estimate for prevalence studies is afunction of expected prevalence and precision for a givenlevel of confidence expressed by the z statistic. Choice ofthe appropriate values for these variables is sometimes notstraight-forward. Certain other situations do not fulfil theassumptions made in the conventional equation and presenta special challenge. These situations include, but are notlimited to, smaller population size in relation to sample size,sampling technique or missing data. This paper discussespractical issues in sample size estimation for prevalencestudies with an objective to help clinicians and healthcareresearchers make more informed decisions whether review-ing or conducting such a study.

Keywords Sample size estimation . Prevalence . Precision

Introduction

Sample size estimation is a key issue in design of moststudies. In a study conducted to estimate prevalence of agiven condition in a geographic area, the objective is tosample sufficient population to get adequate number ofsubjects correctly classified as having the condition of in-terest or not, with a given confidence about the amount towhich this estimate might be affected by sampling error.Several factors may potentially contribute to difficulty inthis regard, including misclassification of diseased andhealthy subjects due to limitations of diagnostic test, thegiven condition being very common or uncommon, or,limitations of sampling technique. This paper discusses theissues involved in sample size estimation for prevalencestudies with an objective to provide practically useful infor-mation to clinicians and healthcare researchers.

Let us say our objective is to determine the prevalence ofa disease condition in the population living in an isolatedgeographic area. Here, one simply cannot perform the diag-nostic test on the whole population. Such an approachwould be impractical due to logistic or financial reasons,and, ethically unacceptable. Moreover, there would still bemisclassification because the predictive values of the testwould not be 100 %. So our target is to find out sufficientnumber of people sampled at random from this population,which if subjected to diagnostic testing, would yield anestimate of the prevalence of the condition in the givenpopulation with some confidence. It can be argued that ifthe condition is very rare in the given population, a largersample would be required to yield sufficient number of

R. AryaComprehensive Epilepsy Centre, Division of Neurology,Cincinnati Children’s Hospital Medical Centre,Cincinnati, OH, USA

B. AntonisamyDepartment of Biostatistics, Christian Medical College,Vellore, Tamil Nadu, India

S. KumarDivision of Basic and Translational Research,Department of Surgery, University of Minnesota,Minneapolis, MN, USA

R. Arya (*)MLC 2015, Cincinnati Children’s Hospital Medical Centre,3333 Burnet avenue,Cincinnati, OH 45229, USAe-mail: [email protected]

Indian J Pediatr (November 2012) 79(11):1482–1488DOI 10.1007/s12098-012-0763-3

Page 2: Sample Size Estimation in Prevalence Studies

‘cases’ and ‘non-cases’; if it is relatively common we mightneed a rather small number.1 Hence, sample size n is directlyproportional to the prevalence of the disease P in the popu-lation. Secondly, we have to decide how precisely we wantto estimate this prevalence. This is usually measured as theamount of acceptable (or allowable) margin of error in theprevalence estimate. This is a random sampling error whichdecreases on repeated trials. Hence, n is inversely propor-tional to allowable error d, which is a surrogate measure ofprecision. These relationships can be expressed by a simpleequation.

n ¼ z2ð ÞP 1� Pð Þd2

Where n0sample size, z0z statistic for the level ofconfidence, P0expected prevalence and d0allowableerror. This formula assumes that P and d are decimalvalues, but would hold correct also if they are percen-tages, except that the term (1-P) in numerator wouldbecome (100-P). In this straight forward equation(Box 1), several practical issues arise in the choice ofvalues for z, P and d.

Precision

Precision of any measurement system is the degree towhich repeated measurements under unchanged condi-tions generate similar results. It captures the extent ofreproducibility or repeatability of measurements underidentical measuring conditions. There are different waysof expressing precision e.g., in arithmetic it is usuallyexpressed by last significant place. To illustrate this,consider a value expressed as ‘8’. It would mean thatthe measurement has been made with a precision of 1(the measuring instrument was able to measure onlydown to units place) whereas a value of 8.0, thoughnumerically equal to 8, would mean that the value atthe first decimal place was measured and was found tobe zero. The second value is more precise. When meas-urements are repeated and averaged, the standard errorof mean serves as an indicator of precision. In preva-lence studies, results might be stated as “the prevalenceof this condition was found in this sample to be 30 %[95 % confidence interval (CI) 20 to 40 %]”. Thisstatement can be interpreted that, if we were to repeatour study 100 times, 95 times our estimated prevalenceis likely to fall between 20–40 % (Frequentist way),and, in this particular study we are 95 % confident that the

‘true’ prevalence lies between 20 to 40 % (Bayesian way).Sample size estimation for prevalence surveys has 2 inter-related measures of precision, which are discussed subse-quently: CI and the allowable margin of error d.

Z Statistic

Whenever we make inferences about population from asample, an element of uncertainty is introduced. One wayto quantify this uncertainty is the use of CI. Z statisticindicates the number of standard deviations an observationis above or below the ‘population’ mean. It captures thelevel of confidence, assuming a normal distribution. Forconventional 95 % confidence level, the z value is 1.96,since 95 % of a normal distribution would lie within ±1.96standard deviations on either side of the mean. The value ofz is chosen by the investigator according to the desired levelof confidence.

Allowable Margin of Error (d)

Allowable error is the maximum risk in the sample sizeestimate acceptable to the clinician and should be de-cided a priori. In any protocol or manuscript, it shouldbe stated explicitly along with the basis for its choice.Conventionally, an ‘absolute’ allowable error margin dof ±5 % is chosen, but, as is common in clinicalpractice, if expected prevalence P is <10 %, the 95 %confidence boundaries may cross 0, which is impracti-cal. Hence, for an expected prevalence between 10 to90 % (P00.1 to 0.9), d of ±5 % might be a reasonablechoice. But for rare (P<0.1) or very common (P>0.9)conditions, d should be chosen as a ‘relative’ value withrespect to expected prevalence P. A common recommen-dation is to set d 0 P/2 for rare and d 0 (1-P)/2 for verycommon conditions [1]. The choice of ‘relative’ allow-able error as opposed to an absolute value, is indepen-dent of expected prevalence and one might choose it formid-range values of P which is a valid approach. Inves-tigators may sometimes wish to choose a smaller d,which would increase the sample size considerably,since n is inversely proportional to d2. Conversely, ifthe sample size is impractical, investigator may betempted to increase the allowable error and get asmaller n. Instead, the researcher could state that dueto resource constraints, the ideal sample size cannot bemet and hence, a smaller and pragmatic sample isinvestigated. Also, a larger d might invalidate the as-sumption of normal approximation. It should be appre-ciated that allowable margin of error in the sample sizeestimate is a conscious choice of investigator, whereas,

1 If a trait is too common, we would again run into problems, as wewould now be short of ‘non-cases’. We will address this in a while.

Indian J Pediatr (November 2012) 79(11):1482–1488 1483

Page 3: Sample Size Estimation in Prevalence Studies

the element of chance is captured by the chosen CI or Zstatistic.

Expected Prevalence (P)

This might seem like a paradox of sorts. Our objective inconducting the study is to determine the prevalence of aparticular condition in a particular population i.e., it is this Pwhich we are out to find. But to do so, we need to have aprior idea of it! Mostly, this idea can be had from priorstudies. Usually, the previous studies will give a range ofP and not a single number. It has been suggested that oneshould err towards P00.5 in these situations [2]. That is, ifthe range provided by previous studies is 30–40 %, then Pshould be set at 0.4, whereas if the range is 70–80 %, it isbetter to take P00.7, i.e., the value nearer to 0.5 or 50 %.This recommendation is based on the fact that choosing avalue nearer to 0.5 leads to the largest n, within certainlimitations (Fig. 1).

Suppose one is doing the first ever prevalence study for aparticular condition in a given population, then there is noprevious study to help estimate P. In such a situation, someauthors [3] recommend that n may be calculated usingP00.5. As ascertained from Fig. 1, this contention is validif P lies between 10 to 90 %, as it will give the largestestimate for n. However, for rare (P<0.1) and very common(P>0.9) conditions, the sample size estimated with an as-sumption of P00.5 is likely to be unsuitable. Suppose weplug values of P00.5, d00.05 in our formula and obtain n0385 (Fig. 1). This will be the empirical sample size estimatefor all purposes, when we make a blind assumption of P00.5. Assume that we are dealing with a rare condition havinga true prevalence of only 1 %, that is to say P00.01. In sucha case, our sample is likely to capture only 3 or 4 cases.There is a finite probability that it may not capture even asingle case! On the contrary if we are dealing with a verycommon condition, say whose prevalence is 99 % (P00.99),then we will not be able to capture sufficient non-cases withour sample size and as above, we may indeed capture none.

Box 1 The missing α

1484 Indian J Pediatr (November 2012) 79(11):1482–1488

Page 4: Sample Size Estimation in Prevalence Studies

In either case, the assumption of normal approximation isinvalidated.

At a minimum this requires an estimate as to whetherP is <0.1, falls between 0.1 and 0.9, or is >0.9. If wecan estimate this, our value for allowable error (d) willchange, as discussed above, and we will get a moremeaningful calculation for sample size (n). Table 1illustrates this calculation for few representative valuesof P<0.05 and P>0.95. Notice how n is different forthese extremes of P from the empirical figure of 385,increasing symmetrically as we approach P00 or P01.It is obvious intuitively that at P00, n will becomeinfinite. If there are no cases of the given condition ina given population, no matter how large a sample wetake, we will never find one! Similar argument holdstrue for P01. So, our problem boils down to findingwhether P is <0.1, between 0.1 and 0.9 or >0.9. Thiscan be easily estimated by even a crude pilot study. Tosummarize, if we have an estimate for P from priorstudies, we can use it erring towards P00.5; otherwise,

the best strategy is to conduct a pilot study, estimate P,and use it for calculating a sample size.

Assumption of Normal Approximation

The formula for sample size estimation is based on theassumption that the sample will capture and correctly clas-sify certain minimum number (05) of cases and non-cases.This means that nP and n(1-P) must be Q5. We know thatallowable error is directly related to standard error of pro-portion (SEp) by the equation:

d ¼ z � SEp

Also:

SEp ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP 1� Pð Þ

n

r

Using above 2 equations and re-arranging, we can get theformula for n, discussed above. For z01.96 and d00.05, theequation would be:

0:05 ¼ 1:96 � SEp

This means that 1.96 standard errors of our estimatewould be equal to 0.05. Stated otherwise, if 1.96*SEp isequal to 0.05, then our sample size estimate has a 95 %chance of being within 5 percentage points of the trueprevalence. This inherently assumes that n is sufficientlylarge, so that all possible values of P will have a normaldistribution. It also means that if we were to repeat ourstudy, about 95 % of times our prevalence estimate will fallwithin the value specified by P±1.96*SEp. This indeed,shows that these values have a normal distribution with amean of P and standard deviation of d/Z, approximately

Table 1 Calculating nfor very rare (P<0.05)and very common(P>0.95) conditions

Precision is calculatedas discussed in text

z d P n

1.96 0.005 0.01 1521.27

1.96 0.01 0.02 752.95

1.96 0.015 0.03 496.85

1.96 0.02 0.04 368.79

1.96 0.025 0.05 291.96

1.96 0.025 0.95 291.96

1.96 0.02 0.96 368.79

1.96 0.015 0.97 496.84

1.96 0.01 0.98 752.95

1.96 0.005 0.99 1521.27

Fig. 1 Sample size n (y-axis) asa function of expectedprevalence P (x-axis) forZ=1.96 and d=0.05, has amaxima at n=385 for P=0.5.For P=1 the equation returns anindeterminate value

Indian J Pediatr (November 2012) 79(11):1482–1488 1485

Page 5: Sample Size Estimation in Prevalence Studies

(Box 2). This is nothing but approximating binomial distri-bution with a normal distribution, justified by the centrallimit theorem. In this context, P is the proportion of suc-cesses in a Bernoulli trial process estimated from a sampleof size n with z(1-α/2) as the (1-α/2) percentile of standardnormal distribution and α as the error percentile. The centrallimit theorem is applicable to a binomial distribution whenthe proportion is not very close to 0 or 1 i.e., the normalapproximation fails when sample proportion ‘approaches’these values.

Finite Population Correction

The above sample size formula is valid only for sampleswhich are relatively small as compared to total population. Iftotal population is N, then n/N should be e0.05. When thesampling fraction is larger than approximately 5 %, theestimate of standard error, used in the sample size formula,

must be corrected by multiplying with a finite populationcorrection (FPC), to account for the added precision gainedby sampling close to a larger percentage of population.

FPC ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðN � nÞ=ðN � 1Þ

p

The effect of FPC is annulled when n 0 N. The formulafor sample size calculation including the effect of FPC is:

n0 ¼ N z2ð ÞP 1� Pð Þ

d2ð Þ N � 1ð Þ þ z2ð ÞP 1� Pð Þ

Where n′ is the sample size with finite populationcorrection, N is the population size, Z is the statisticfor the level of confidence, P is expected proportion andd is allowable error. The need for FPC arises becausetypically the sampling method in prevalence studies iswithout replacement (hypergeometric sampling), but thesample size estimation method is based on samplingwith replacement (binomial distribution). When the tar-

,

,

Box 2 Worked exampleillustrating principles of samplesize estimation in prevalencestudy

1486 Indian J Pediatr (November 2012) 79(11):1482–1488

Page 6: Sample Size Estimation in Prevalence Studies

get population is very large (theoretically infinite),hypergeometric distribution can be well approximatedby a binomial one, and thus, formula based on latteris acceptable. Any investigator using FPC in samplesize estimate should also incorporate adjustment forhypergeometric sampling in analysis.

Design Effect

Above calculation for sample size assume independenceamong sampling units i.e., it is only valid for simple orsystematic random sampling. Lack of independence intro-duced due to other sampling methods should be adjusted bychanging the variance estimate e.g., for clustered sampling,the design effect (DE) or variance inflation factor is definedas:

DE ¼ 1þ ρðm� 1ÞWhere ρ is the intra-class correlation and m is the sample

size within each cluster. The intra-class correlation is arelative measure of the homogeneity of sampling units with-in each cluster compared with a randomly selected unit,defined as the proportion of total variation within samplingunits that can be accounted by variation among clusters. It isgenerally estimated from prior literature or pilot data. If thenumber of clusters is fixed by design and the sample size percluster (m) is unknown then that has to be estimated first as:

m ¼ ES 1� ρð Þk � ρES

Where ES is the effective sample size assuming indepen-dence and k is the number of clusters. Finally, the samplesize (n′) for cluster or multi-stage sampling is calculatedusing:

n0 ¼ n � DEWhere n is the sample size estimated assuming a simple

random sampling. DE is actually a ratio measure thatdescribes how much precision is gained or lost if a morecomplex sampling strategy is used instead of simple randomsampling. Usually, complex sampling techniques lead to adecrease of precision, resulting in DE >1. This also impliesthat for the same precision, cluster or multistage samplingtechnique would require a larger sample size than simple orsystematic random sampling. A classic example is clustersurveys for immunization coverage, where DE has shown tobe approximately 1.9 [2].

Sample size calculations are based on large sample ap-proximation methods. Together FPC, DE and continuitycorrection are ‘adjustment factors’ which help improve spe-cific sample size approximation to exact distributions.

Continuity correction factor is employed to compare twopopulation proportions, usually in a diagnostic evaluation,and is not discussed further here [4].

Non-Response Correction

Formula for sample size estimation inherently assumes thatfor a chosen CI or d, there will be no non-response ormissing data, a condition rarely met. Suppose we wish toestimate the prevalence of depression in a given populationusing a questionnaire. In real life situation, it is improbablethat every one of our subjects will meticulously fill thesurvey form and return it, resulting in some loss to ourestimated sample size. This loss is difficult to estimateupfront, and, is usually a clinical decision. If n is the desiredsample size, then for an expected loss r we would have tooversample to n′ such that:

n0 � rn

0 ¼ n

Solving for n’:

n0 ¼ n

1� r

‘Zero Patient’ Method

A very interesting method has been proposed to providecomparative estimates of the frequencies of rare conditionseven when the true prevalences remain unknown [5]. Let ussay we take a sample of n people from a population suchthat all people in our sample are free of disease. Then the95 % upper bound CI for the prevalence P of the disease canbe computed from the equation:

a ¼ 1� Pð Þn

This corresponds to rejection of the null hypothesis thatthe true prevalence is P at a 1-sided significance level of α.Taking the natural logarithm, setting α00.05 and re-arranging, we get a simple relationship:

n ¼ � ln 0:05ð ÞP

¼ 3

P

This simplification makes an assumption that ln1� Pð Þ ¼ �P , which holds true only for P<0.02.Hence, there are two important things in using thisformula for sample size estimation. First, a previousestimate of the prevalence in a region with relativelyhigh disease frequency, determined by conventionalmethods, will be helpful in determining the size ofpopulation to be screened in a region thought to be

Indian J Pediatr (November 2012) 79(11):1482–1488 1487

Page 7: Sample Size Estimation in Prevalence Studies

having lesser prevalence. Secondly, this formula will besuitable only for conditions with prevalence less than2 % in the target population undergoing screening.

For example, Gaucher disease type 3 is relatively com-mon in the population of Northern Swedish region of Nor-botten having a prevalence of approximately 0.0006. So, ifone screens about 3/0.000605000 people from any popula-tion and does not find a case of this disease in the sample,then one could say with 95 % confidence that the prevalenceof Gaucher disease type 3 is less than 0.0006 in that popu-lation. Extending this argument, if 10000 people from thesame population are screened for Gaucher disease type 3and not a single case is found, then a conclusion can bemade with 95 % confidence that the prevalence of thisdisease is less than 0.0003 in the given population.

If we introduce the concept of statistical power in thisformula, it will represent the probability of all personswithin the sample of size n being disease free, given thetrue prevalence of disease in the population is p′ (alternativehypothesis) instead of P (null hypothesis), such that p′<<P.In such a case, power (Θ) will be:

Θ ¼ 1� P0

� �n

Taking natural logarithms and substituting n03/P andΘ00.8 yields P′00.074*P. In the example of Gaucher dis-ease type 3, this means that assuming the null prevalence inNorthern Swedish region of 0.0006, the alternative preva-lence in our hypothetical population should be less than(0.074*0.00060) 4.4*10−5 for this method to have at least80 % power. This power will increase for yet smallerprevalence.

This method is called “zero patient” design and can onlybe used for estimating relative prevalence of disease in apopulation, provided an estimate of its prevalence is

available in some other population having relatively higherprevalence of it and the disease has an estimated prevalenceof less than 2 % in the population in which we want to havethe estimate. This method cannot provide an estimate of trueprevalence but can generate a hypothesis for the same.

Conclusions

Sample size estimation for prevalence studies is straight-forward for the most part. Certain challenging situations likeextremes of prevalence, small target population, samplingtechnique, expected misclassification or missing data re-quire diligent choice of calculating formula and plug invalues.

Conflict of Interest None.

Role of Funding Source None.

References

1. Antonisamy B, Christopher S, Samuel PP. Biostatistics: principlesand practice. New Delhi: Tata McGraw-Hill; 2010.

2. Macfarlane SB. Conducting a descriptive survey: 2. Choosing asampling strategy. Trop Doct. 1997;27:14–21.

3. Lwanga SK, Lemeshow S. Sample size determination in healthstudies: a practical manual. Geneva: World Health Organization;1991.

4. Fosgate GT. Practical sample size calculations for surveillance anddiagnostic investigations. J Vet Diagn Invest. 2009;21:3–14.

5. Yazici H, Biyikli M, van der Linden S, Schouten HJ. The ‘zeropatient’ design to compare the prevalences of rare diseases. Rheu-matology (Oxford). 2001;40:121–2.

1488 Indian J Pediatr (November 2012) 79(11):1482–1488