chapter 3people.stat.sfu.ca/~cschwarz/coursenotes/... · farley mowat, never cry wolf. mclelland...

124
Chapter 3 Sampling Contents 3.1 Introduction ...................................... 101 3.1.1 Difference between sampling and experimental design ............. 101 3.1.2 Why sample rather than census? ........................ 101 3.1.3 Principle steps in a survey ............................ 102 3.1.4 Probability sampling vs. non-probability sampling ............... 102 3.1.5 The importance of randomization in survey design ............... 103 3.1.6 Model vs. Design based sampling ........................ 106 3.1.7 Software ..................................... 107 3.2 Overview of Sampling Methods ............................ 107 3.2.1 Simple Random Sampling ............................ 107 3.2.2 Systematic Surveys ............................... 109 3.2.3 Cluster sampling ................................ 111 3.2.4 Multi-stage sampling .............................. 115 3.2.5 Multi-phase designs ............................... 117 3.2.6 Panel design - suitable for long-term monitoring ................ 119 3.2.7 Sampling non-discrete objects ......................... 120 3.2.8 Key considerations when designing or analyzing a survey ........... 120 3.3 Notation ......................................... 121 3.4 Simple Random Sampling Without Replacement (SRSWOR) ........... 122 3.4.1 Summary of main results ............................ 122 3.4.2 Estimating the Population Mean ........................ 123 3.4.3 Estimating the Population Total ......................... 124 3.4.4 Estimating Population Proportions ....................... 124 3.4.5 Example - estimating total catch of fish in a recreational fishery ........ 125 3.5 Sample size determination for a simple random sample ............... 131 3.5.1 Example - How many angling-parties to survey ................. 133 3.6 Systematic sampling .................................. 136 3.6.1 Advantages of systematic sampling ....................... 136 3.6.2 Disadvantages of systematic sampling ..................... 137 3.6.3 How to select a systematic sample ....................... 137 3.6.4 Analyzing a systematic sample ......................... 137 3.6.5 Technical notes - Repeated systematic sampling ................ 138 3.7 Stratified simple random sampling .......................... 140 99

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

Chapter 3

Sampling

Contents3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3.1.1 Difference between sampling and experimental design . . . . . . . . . . . . . 1013.1.2 Why sample rather than census? . . . . . . . . . . . . . . . . . . . . . . . . 1013.1.3 Principle steps in a survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023.1.4 Probability sampling vs. non-probability sampling . . . . . . . . . . . . . . . 1023.1.5 The importance of randomization in survey design . . . . . . . . . . . . . . . 1033.1.6 Model vs. Design based sampling . . . . . . . . . . . . . . . . . . . . . . . . 1063.1.7 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.2 Overview of Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.2.1 Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073.2.2 Systematic Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093.2.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113.2.4 Multi-stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153.2.5 Multi-phase designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.2.6 Panel design - suitable for long-term monitoring . . . . . . . . . . . . . . . . 1193.2.7 Sampling non-discrete objects . . . . . . . . . . . . . . . . . . . . . . . . . 1203.2.8 Key considerations when designing or analyzing a survey . . . . . . . . . . . 120

3.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1213.4 Simple Random Sampling Without Replacement (SRSWOR) . . . . . . . . . . . 122

3.4.1 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223.4.2 Estimating the Population Mean . . . . . . . . . . . . . . . . . . . . . . . . 1233.4.3 Estimating the Population Total . . . . . . . . . . . . . . . . . . . . . . . . . 1243.4.4 Estimating Population Proportions . . . . . . . . . . . . . . . . . . . . . . . 1243.4.5 Example - estimating total catch of fish in a recreational fishery . . . . . . . . 125

3.5 Sample size determination for a simple random sample . . . . . . . . . . . . . . . 1313.5.1 Example - How many angling-parties to survey . . . . . . . . . . . . . . . . . 133

3.6 Systematic sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.6.1 Advantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . . . 1363.6.2 Disadvantages of systematic sampling . . . . . . . . . . . . . . . . . . . . . 1373.6.3 How to select a systematic sample . . . . . . . . . . . . . . . . . . . . . . . 1373.6.4 Analyzing a systematic sample . . . . . . . . . . . . . . . . . . . . . . . . . 1373.6.5 Technical notes - Repeated systematic sampling . . . . . . . . . . . . . . . . 138

3.7 Stratified simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 140

99

Page 2: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.7.1 A visual comparison of a simple random sample vs. a stratified simple randomsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

3.7.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1483.7.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1483.7.4 Example - sampling organic matter from a lake . . . . . . . . . . . . . . . . . 1493.7.5 Example - estimating the total catch of salmon . . . . . . . . . . . . . . . . . 1533.7.6 Sample Size for Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . 1613.7.7 Allocating samples among strata . . . . . . . . . . . . . . . . . . . . . . . . 1633.7.8 Example: Estimating the number of tundra swans. . . . . . . . . . . . . . . . 1663.7.9 Post-stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1703.7.10 Allocation and precision - revisited . . . . . . . . . . . . . . . . . . . . . . . 173

3.8 Ratio estimation in SRS - improving precision with auxiliary information . . . . . 1743.8.1 Summary of Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1753.8.2 Example - wolf/moose ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 1763.8.3 Example - Grouse numbers - using a ratio estimator to estimate a population total183

3.9 Additional ways to improve precision . . . . . . . . . . . . . . . . . . . . . . . . . 1913.9.1 Using both stratification and auxiliary variables . . . . . . . . . . . . . . . . 1923.9.2 Regression Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1923.9.3 Sampling with unequal probability - pps sampling . . . . . . . . . . . . . . . 193

3.10 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1933.10.1 Sampling plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1943.10.2 Advantages and disadvantages of cluster sampling compared to SRS . . . . . 1963.10.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1973.10.4 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1973.10.5 Example - estimating the density of urchins . . . . . . . . . . . . . . . . . . . 1983.10.6 Example - estimating the total number of sea cucumbers . . . . . . . . . . . . 204

3.11 Multi-stage sampling - a generalization of cluster sampling . . . . . . . . . . . . . 2113.11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2113.11.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2123.11.3 Summary of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2123.11.4 Example - estimating number of clams . . . . . . . . . . . . . . . . . . . . . 2133.11.5 Some closing comments on multi-stage designs . . . . . . . . . . . . . . . . 216

3.12 Analytical surveys - almost experimental design . . . . . . . . . . . . . . . . . . . 2173.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2193.14 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . 220

3.14.1 Confusion about the definition of a population . . . . . . . . . . . . . . . . . 2203.14.2 How is N defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2213.14.3 Multi-stage vs. Multi-phase sampling . . . . . . . . . . . . . . . . . . . . . . 2213.14.4 What is the difference between a Population and a frame? . . . . . . . . . . . 2223.14.5 How to account for missing transects. . . . . . . . . . . . . . . . . . . . . . . 222

The suggested citation for this chapter of notes is:

Schwarz, C. J. (2019). Sampling.In Course Notes for Beginning and Intermediate Statistics.Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved2019-11-03.

c©2019 Carl James Schwarz 100 2019-11-03

Page 3: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.1 Introduction

Today the word "survey" is used most often to describe a method of gathering information from a sampleof individuals or animals or areas. This "sample" is usually just a fraction of the population being studied.

You are exposed to survey results almost every day. For example, election polls, the unemploymentrate, or the consumer price index are all examples of the results of surveys. On the other hand, somecommon headlines are NOT the results of surveys, but rather the results of experiments. For example, isa new drug just as effective as an old drug.

Not only do surveys have a wide variety of purposes, they also can be conducted in many ways – in-cluding over the telephone, by mail, or in person. Nonetheless, all surveys do have certain characteristicsin common. All surveys require a great deal of planning in order that the results are informative.

Unlike a census, where all members of the population are studied, surveys gather information fromonly a portion of a population of interest – the size of the sample depending on the purpose of the study.Surprisingly to many people, a survey can give better quality results than an census.

In a bona fide survey, the sample is not selected haphazardly. It is scientifically chosen so that eachobject in the population will have a measurable chance of selection. This way, the results can be reliablyprojected from the sample to the larger population.

Information is collected by means of standardized procedures The survey’s intent is not to describethe particular object which, by chance, are part of the sample but to obtain a composite profile of thepopulation.

3.1.1 Difference between sampling and experimental design

There are two key differences between survey sampling and experimental design.

• In experiments, one deliberately perturbs some part of population to see the effect of the action.In sampling, one wishes to see what the population is like without disturbing it.

• In experiments, the objective is to compare the mean response to changes in levels of the factors.In sampling the objective is to describe the characteristics of the population. However, refer tothe section on analytical sampling later in this chapter for when sampling looks very similar toexperimental design.

3.1.2 Why sample rather than census?

There are a number of advantages of sampling over a complete census:

• reduced cost

• greater speed - a much smaller scale of operations is performed

• greater scope - if highly trained personnel or equipment is needed

• greater accuracy - easier to train small crew, supervise them, and reduce data entry errors

c©2019 Carl James Schwarz 101 2019-11-03

Page 4: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• reduced respondent burden

• in destructive sampling you can’t measure the entire population - e.g. crash tests of cars

3.1.3 Principle steps in a survey

The principle steps in a survey are:

• formulate the objectives of the survey - need concise statement

• define the population to be sampled - e.g. what is the range of animals or locations to be measured?Note that the population is the set of final sampling units that will be measured - refer to the FAQat the end of the chapter for more information.

• establish what data is to be collected - collect a few items well rather than many poorly

• what degree of precision is required - examine power needed

• establish the frame - this is a list of sampling units that is exhaustive and exclusive

– in many cases the frame is obvious, but in others it is not

– it is often very difficult to establish a frame - e.g. a list of all streams in the lower mainland.

• choose among the various designs; will you stratify? There are a variety of sampling plans some ofwhich will be discussed in detail later in this chapter. Some common designs in ecological studiesare:

– simple random sampling

– systematic sample

– cluster sampling

– multi-stage design

All designs can be improved by stratification, so this should always be considered during thedesign phase.

• pre-test - very important to try out field methods and questionnaires

• organization of field work - training, pre-test, etc

• summary and data analysis - easiest part if earlier parts done well

• post-mortem - what went well, poorly, etc.

3.1.4 Probability sampling vs. non-probability sampling

There are two types of sampling plans - probability sampling where units are chosen in a ‘randomfashion’ and non-probability sampling where units are chosen in some deliberate fashion.

In probability sampling

• every unit has a known probability of being in the sample

• the sample is drawn with some method consistent with these probabilities

c©2019 Carl James Schwarz 102 2019-11-03

Page 5: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• these selection probabilities are used when making estimates from the sample

The advantages of probability sampling

• we can study biases of the sampling plans

• standard errors and measures of precision (confidence limits) can be obtained

Some types of non-probability sampling plan include:

• quota sampling - select 50 M and 50 F from the population

– less expensive than a probability sample

– may be only option if no frame exists

• judgmental sampling - select ‘average’ or ‘typical’ value. This is a quick and dirty samplingmethod and can perform well if there are a few extreme points which should not be included.

• convenience sampling - select those readily available. This is useful if is dangerous or unpleasantto sample directly. For example, selecting blood samples from grizzly bears.

• haphazard sampling (not the same as random sampling). This is often useful if the samplingmaterial is homogeneous and spread throughout the population, e.g. chemicals in drinking water.

The disadvantages of non-probability sampling include

• unable to assess biases in any rational way.

• no estimates of precision can be obtained. In particular the simple use of formulae from proba-bility sampling is WRONG!.

• experts may disagree on what is the “best” sample.

3.1.5 The importance of randomization in survey design

[With thanks to Dr. Rick Routledge for this part of the notes.]

. . . I had to make a ‘cover degree’ study... This involved the use of a Raunkiaer’s Circle,a device designed in hell. In appearance it was all simple innocence, being no more thana big metal hoop; but in use it was a devil’s mechanism for driving sane men mad. To useit, one stood on a stretch of muskeg, shut one’s eyes, spun around several times like a top,and then flung the circle as far away as possible. This complicated procedure was designedto ensure that the throw was truly ‘random’; but, in the event, it inevitably resulted in mylosing sight of the hoop entirely, and having to spend an unconscionable time searching forthe thing.

Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963.

Why would a field biologist in the early post-war period be instructed to follow such a bizarre-looking scheme for collecting a representative sample of tundra vegetation? Could she not have obtained

c©2019 Carl James Schwarz 103 2019-11-03

Page 6: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

a typical cross-section of the vegetation by using her own judgment? Undoubtedly, she could haveconvinced herself that by replacing an awkward, haphazard sampling scheme with one dependent solelyon her own judgment and common sense, she could have been guaranteed a more representative sample.But would others be convinced? A careful, objective scientist is trained to be skeptical. She wouldbe reluctant to accept any evidence whose validity depended critically on the judgment and skills of astranger. The burden of proof would then rest squarely with Farley Mowat to prove his ability to takerepresentative, judgmental samples. It is typically far easier for a scientist to use randomization in hersampling procedures than it is to prove her judgmental skills.

Hovering and Patrolling Bees

It is often difficult, if not impossible, to take a properly randomized sample. Consider, e.g., theproblem faced by Alcock et al. (1977) in studying the behavior of male bees of the species, Centrispallida, in the deserts of south-western United States. Females pupate in underground burrows. Tomaximize the presence of his genes in the next generation, a male of the species needs to mate withas many virgin females as possible. One strategy is to patrol the burrowing area at a low altitude, andnab an emerging female as soon as her presence is detected. This patrolling strategy seems to involvea relatively high risk of confrontation with other patrolling males. The other strategy reported by theauthors is to hover farther above the burrowing area, and mate with those females who escape detectionby the hoverers. These hoverers appear to be involved in fewer conflicts.

Because the hoverers tend to be less involved in aggressive confrontations, one might guess that theywould tend to be somewhat smaller than the more aggressive patrollers. To assess this hypothesis, theauthors took measurements of head widths for each of the two subpopulations. Of course, they could notcapture every single male bee in the population. They had to be content with a sample.

Sample sizes and results are reported in the Table below. How are we to interpret these results?The sampled hoverers obviously tended to be somewhat smaller than the sampled patrollers, although itappears from the standard deviations that some hoverers were larger than the average-sized patroller andvice-versa. Hence, the difference is not overwhelming, and may be attributable to sampling errors.

Table Summary of head width measurements on two samples of bees.Sample n y SD

Hoverers 50 4.92 mm 0.15 mm

Patrollers 100 5.14 mm 0.29 mm

If the sampling were truly randomized, then the only sampling errors would be chance errors, whoseprobable size can be assessed by a standard t-test. Exactly how were the samples taken? Is it possiblethat the sampling procedure used to select patrolling bees might favor the capture of larger bees, forexample? This issue is indeed addressed by the authors. They carefully explain how they attemptedto obtain unbiased samples. For example, to sample the patrolling bees, they made a sweep across thesampling area, attempting to catch all the patrolling bees that they observed. To assess the potential forbias, one must in the end make a subjective judgment.

Why make all this fuss over a technical possibility? It is important to do so because lack of attentionto such possibilities has led to some colossal errors in the past. Nowhere are they more obvious thanin the field of election prediction. Most of us never find out the real nature of the population that weare sampling. Hence, we never know the true size of our errors. By contrast, pollsters’ errors are oftenpainfully obvious. After the election, the actual percentages are available for everyone to see.

Lessons from Opinion Polling

In the 1930’s, political opinion was in its formative years. The pioneers in this endeavor were train-

c©2019 Carl James Schwarz 104 2019-11-03

Page 7: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

ing themselves on the job. Of the inevitable errors, two were so spectacular as to make internationalheadlines.

In 1935, an American magazine with a large circulation, The Literary Digest, attempted to poll anenormous segment of the American voting public in order to predict the outcome of the presidentialelection that autumn. Roosevelt, the Democratic candidate, promised to develop programs designed toincrease opportunities for the disadvantaged; Landon, the candidate for the Republican Party, appealedmore to the wealthier segments of American society. The Literary Digest mailed out questionnaires toabout ten million people whose names appeared in such places as subscription lists, club directories, etc.They received over 2.5 million responses, on the basis of which they predicted a comfortable victory forLandon. The election returns soon showed the massive size of their prediction error.

The cumbersome design of this highly publicized survey provided a young, wily pollster with thechance of a lifetime. Between the time that the Digest announced its plans and released its predictions,George Gallup planned and executed a remarkable coup. By polling only a small fraction of theseindividuals, and a relatively small number of other voters, he correctly predicted not only the outcomeof the election, but also the enormous size of the error about to be committed by The Literary Digest.

Obviously, the enormous sample obtained by the Digest was not very representative of the population.The selection procedure was heavily biased in favor of Republican voters. The most obvious source ofbias is the method used to generate the list of names and addresses of the people that they contacted.In 1935, only the relatively affluent could afford magazines, telephones, etc., and the more conservativepolicies of the Republican Party appealed to a greater proportion of this segment of the American public.The Digest’s sample selection procedure was therefore biased in favor of the Republican candidate.

The Literary Digest was guilty of taking a sample of convenience. Samples of convenience aretypically prone to bias. Any researcher who, either by choice or necessity, uses such a sample, has tobe prepared to defend his findings against possible charges of bias. As this example shows, it can havecatastrophic consequences.

How did Gallup obtain his more representative sample? He did not use randomization. Random-ization is often criticized on the grounds that once in a while, it can produce absurdly unrepresentativesamples. When faced with a sample that obviously contains far too few economically disadvantagedvoters, it is small consolation to know that next time around, the error will likely not be repeated. Gallupused a procedure that virtually guaranteed that his sample would be representative with respect to suchobvious features as age, race, etc. He did so by assigning quotas which his interviewers were to fill. Oneinterviewer might, e.g., be assigned to interview 5 adult males with specified characteristics in a tough,inner-city neighborhood. The quotas were devised so as to make the sample mimic known features ofthe population.

This quota sampling technique suited Gallup’s needs spectacularly well in 1935 even though heunderestimated the support for the Democratic candidate by about 6%. His subsequent polls containedthe same systematic error. In 1947, the error finally caught up with him. He predicted a narrow victoryfor the Republican candidate, Dewey. A Newspaper editor was so confident of the prediction that heauthorized the printing of a headline proclaiming the victory before the official results were available. Itturned out that the Democrat, Truman, won by a narrow margin.

What was wrong with Gallup’s sampling technique? He gave his interviewers the final decision asto whom would be interviewed. In a tough inner-city neighborhood, an interviewer had the option ofpassing by a house with several motorcycles parked out in front and sounds of a raucous party comingfrom within. In the resulting sample, the more conservative (Republican) voters were systematicallyover-represented.

Gallup learned from his mistakes. His subsequent surveys replaced interviewer discretion with an

c©2019 Carl James Schwarz 105 2019-11-03

Page 8: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

objective, randomized scheme at the final stage of sample selection. With the dominant source of sys-tematic error removed, his election predictions became even more reliable.

Implications for Biological Surveys

The bias in samples of convenience can be enormous. It can be surprisingly large even in what appearto be carefully designed surveys. It can easily exceed the typical size of the chance error terms. Tocompletely remove the possibility of bias in the selection of a sample, randomization must be employed.Sometimes this is simply not possible, as for example, appears to be the case in the study on bees. Whenthis happens and the investigators wish to use the results of a nonrandomized sample, then the final reportshould discuss the possibility of selection bias and its potential impact on the conclusions.

Furthermore, when reading a report containing the results of a survey, it is important to carefullyevaluate the survey design, and to consider the potential impact of sample selection bias on the conclu-sions.

Should Farley Mowat really have been content to take his samples by tossing Raunkier’s Circle tothe winds? Definitely not, for at least two reasons. First, he had to trust that by tossing the circle, hewas generating an unbiased sample. It is not at all certain that certain types of vegetation would not beselected with a higher probability than others. For example, the higher shrubs would tend to interceptthe hoop earlier in its descent than would the smaller herbs. Second, he has no guarantee that his samplewill be representative with respect to the major habitat types. Leaving aside potential bias, it is possiblethat the circle could, by chance, land repeatedly in a snowbed community. It seems indeed foolish to usea sampling scheme which admits the possibility of including only snowbed communities when tundrabogs and fellfields may be equally abundant in the population. In subsequent chapters, we shall look intoways of taking more thoroughly randomized surveys, and into schemes for combining judgment withrandomization for eliminating both selection bias and the potential for grossly unrepresentative samples.There are also circumstances in which a systematic sample (e.g., taking transects every 200 meters alonga rocky shore line) may be justifiable, but this subject is not discussed in these notes.

3.1.6 Model vs. Design based sampling

Model-based sampling starts by assuming some sort of statistical model for the data in the populationand the goal is to select data to estimate the parameters of this distribution. For example, you may bewilling to assume that the distribution of values in the population is log-normally distributed. The datacollected from the survey are then used along with a likelihood function to estimate the parameters ofthe distribution.

Model-based sampling is very powerful because you are willing to make a lot of assumptions aboutthe data process. However, if your model is wrong, there are big problems. For example, what ifyou assume log-normality but data is not log-normally distributed? In these cases, the estimates of theparameters can be extremely biased and inefficient.

Design-based sampling makes no assumptions about the distribution of data values in the popula-tion. Rather it relies upon the randomization procedure to select representative elements of the popula-tion. Estimates from design-based methods are unbiased regardless of the distribution of values in thepopulation, but in “strange” populations can also be inefficient. For example, if a population is highlyclustered, a random sample of quadrats will end up with mostly zero observations and a few large valuesand the resulting estimates will have a large standard error.

Most of the results in this chapter on survey sampling are design-based, i.e. we don’t need to makeany assumptions about normality in the population for the results to valid.

c©2019 Carl James Schwarz 106 2019-11-03

Page 9: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.1.7 Software

For a review of packages that can be used to analyze survey data please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/survey-soft.html.

CAUTIONS IN USING STANDARD STATISTICAL SOFTWARE PACKAGES Standard statis-tical software packages generally do not take into account four common characteristics of sample surveydata: (1) unequal probability selection of observations, (2) clustering of observations, (3) stratificationand (4) nonresponse and other adjustments. Point estimates of population parameters are impacted bythe value of the analysis weight for each observation. These weights depend upon the selection proba-bilities and other survey design features such as stratification and clustering. Hence, standard packageswill yield biased point estimates if the weights are ignored. The estimated standard errors based onsample survey data are impacted by clustering, stratification and the weights. By ignoring these aspects,standard packages generally underestimate the standard error, sometimes substantially so.

Most standard statistical packages can perform weighted analyses, usually via a WEIGHT statementadded to the program code. Use of standard statistical packages with a weighting variable may yieldthe same point estimates for population parameters as sample survey software packages. However,the estimated standard error often is not correct and can be substantially wrong, depending upon theparticular program within the standard software package.

For further information about the problems of using standard statistical software packages in surveysampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

Fortunately, for simple surveys, we can often do the analysis using standard software as will beshown in these notes. Many software packages also have specialized software and, if available, thesewill be demonstrated.

SAS includes many survey design procedures as shown in these notes.

3.2 Overview of Sampling Methods

3.2.1 Simple Random Sampling

This is the basic method of selecting survey units. Each unit in the population is selected with equalprobability and all possible samples are equally likely to be chosen. This is commonly done by listingall the members in the population (the set of sampling units) and then choosing units using a randomnumber table.

An example of a simple random sample would be a vegetation survey in a large forest stand. Thestand is divided into 480 one-hectare plots, and a random sample of 24 plots was selected and analyzedusing aerial photos. The map of the units selected might look like:

c©2019 Carl James Schwarz 107 2019-11-03

Page 10: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Units are usually chosen without replacement, i.e., each unit in the population can only be chosenonce. In some cases (particularly for multi-stage designs), there are advantages to selecting units withreplacement, i.e. a unit in the population may potentially be selected more than once. The analysis ofa simple random sample is straightforward. The mean of the sample is an estimate of the populationmean. An estimate of the population total is obtained by multiplying the sample mean by the number ofunits in the population. The sampling fraction, the proportion of units chosen from the entire population,is typically small. If it exceeds 5%, an adjustment (the finite population correction) will result in better

c©2019 Carl James Schwarz 108 2019-11-03

Page 11: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

estimates of precision (a reduction in the standard error) to account for the fact that a substantial fractionof the population was surveyed.

A simple random sample design is often ‘hidden’ in the details of many other survey designs. Forexample, many surveys of vegetation are conducted using strip transects where the initial starting pointof the transect is randomly chosen, and then every plot along the transect is measured. Here the stripsare the sampling unit, and are a simple random sample from all possible strips. The individual plots aresubsamples from each strip and cannot be regarded as independent samples. For example, suppose arectangular stand is surveyed using aerial overflights. In many cases, random starting points along oneedge are selected, and the aircraft then surveys the entire length of the stand starting at the chosen point.The strips are typically analyzed section- by-section, but it would be incorrect to treat the smaller partsas a simple random sample from the entire stand.

Note that a crucial element of simple random samples is that every sampling unit is chosen indepen-dently of every other sampling unit. For example, in strip transects plots along the same transect are notchosen independently - when a particular transect is chosen, all plots along the transect are sampled andso the selected plots are not a simple random sample of all possible plots. Strip-transects are actuallyexamples of cluster-samples. Cluster samples are discuses in greater detail later in this chapter.

3.2.2 Systematic Surveys

In some cases, it is logistically inconvenient to randomly select sample units from the population. Analternative is to take a systematic sample where every kth unit is selected (after a random starting point);k is chosen to give the required sample size. For example, if a stream is 2 km long, and 20 samples arerequired, then k = 100 and samples are chosen every 100 m along the stream after a random startingpoint. A common alternative when the population does not naturally divide into discrete units is grid-sampling. Here sampling points are located using a grid that is randomly located in the area. All samplingpoints are a fixed distance apart.

An example of a systematice sample would be a vegetation survey in a large forest stand. The standis divided into 480 one-hectare plots. As a total sample size of 24 is required, this implies that we needto sample every 480/24 = 20th plot. We pick a random starting point (the 9th) plot in the first row, andthen every 20 plots reading across rows. The final plan could look like:

c©2019 Carl James Schwarz 109 2019-11-03

Page 12: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

If a known trend is present in the sample, this can be incorporated into the analysis (Cochran, 1977,Chapter 8). For example, suppose that the systematic sample follows an elevation gradient that is knownto directly influence the response variable. A regression-type correction can be incorporated into theanalysis. However, note that this trend must be known from external sources - it cannot be deduced fromthe survey.

Pitfall: A systematic sample is typically analyzed in the same fashion as a simple random sample.

c©2019 Carl James Schwarz 110 2019-11-03

Page 13: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

However, the true precision of an estimator from a systematic sample can be either worse or better thana simple random sample of the same size, depending if units within the systematic sample are positivelyor negatively correlated among themselves. For example, if a systematic sample’s sampling intervalhappens to match a cyclic pattern in the population, values within the systematic sample are highlypositively correlated (the sampled units may all hit the ‘peaks’ of the cyclic trend), and the true samplingprecision is worse than a SRS of the same size. What is even more unfortunate is that because the unitsare positively correlated within the sample, the sample variance will underestimate the true variation inthe population, and if the estimated precision is computed using the formula for a SRS, a double doseof bias in the estimated precision occurs (Krebs, 1989, p.227). On the other hand, if the systematicsample is arranged ‘perpendicular’ to a known trend to try and incorporate additional variability in thesample, the units within a sample are now negatively correlated, the true precision is now better than aSRS sample of the same size, but the sample variance now overestimates the population variance, andthe formula for precision from a SRS will overstate the sampling error. While logistically simpler, asystematic sample is only ‘equivalent’ to a simple random sample of the same size if the populationunits are ‘in random order’ to begin with. (Krebs, 1989, p. 227). Even worse, there is no information inthe systematic sample that allows the manager to check for hidden trends and cycles.

Nevertheless, systematic samples do offer some practical advantages over SRS if some correctioncan be made to the bias in the estimated precision:

• it is easier to relocate plots for long term monitoring

• mapping can be carried out concurrently with the sampling effort because the ground is systemat-ically traversed. This is less of an issue now with GPS as the exact position can easily be recordedand the plots revisited alter.

• it avoids the problem of poorly distributed sampling units which can occur with a SRS [but thiscan also be avoided by judicious stratification.]

Solution: Because of the necessity for a strong assumption of ‘randomness’ in the original popula-tion, systematic samples are discouraged and statistical advice should be sought before starting such ascheme. If there are no other feasible designs, a slight variation in the systematic sample provides someprotection from the above problems. Instead of taking a single systematic sample every kth unit, take 2or 3 independent systematic samples of every 2kth or 3kth unit, each with a different starting point. Forexample, rather than taking a single systematic sample every 100 m along the stream, two independentsystematic samples can be taken, each selecting units every 200 m along the stream starting at two ran-dom starting points. The total sample effort is still the same, but now some measure of the large scalespatial structure can be estimated. This technique is known as replicated sub-sampling (Kish, 1965, p.127).

3.2.3 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For example, some animalscongregate in herds or family units. It is often convenient to select a random sample of herds and thenmeasure every animal in the herd. This is not the same as a simple random sample of animals becauseindividual animals are not randomly selected; the herds are the sampling unit. The strip-transect examplein the section on simple random sampling is also a cluster sample; all plots along a randomly selectedtransect are measured. The strips are the sampling units, while plots within each strip are sub-samplingunits. Another example is circular plot sampling; all trees within a specified radius of a randomly selectedpoint are measured. The sampling unit is the circular plot while trees within the plot are sub-samples.

c©2019 Carl James Schwarz 111 2019-11-03

Page 14: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The reason cluster samples are used is that costs can be reduced compared to a simple random samplegiving the same precision. Because units within a cluster are close together, travel costs among units arereduced. Consequently, more clusters (and more total units) can be surveyed for the same cost as acomparable simple random sample.

For example, consider the vegation survey of previous sections. The 480 plots can be divided into60 clusters of size 8. A total sample size of 24 is obtained by randomly selecting three clusters from the60 clusters present in the map, and then surveying ALL eight members of the seleced clusters. A map ofthe design might look like:

c©2019 Carl James Schwarz 112 2019-11-03

Page 15: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Alternatively, cluster are often formed when a transect sample is taken. For example, suppose thatthe vegetation survey picked an initial starting point on the left margin, and then flew completely acrossthe landscape in a a straight line measuring all plots along the route. A map of the design migh look like:

c©2019 Carl James Schwarz 113 2019-11-03

Page 16: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

c©2019 Carl James Schwarz 114 2019-11-03

Page 17: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

In this case, there are three clusters chosen from a possible 30 clusters and the clusters are of unequalsize (the middle cluster only has 12 plots measured compared to the 18 plots measured on the other twotransects).

Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. Thisis not valid because units within a cluster are typically positively correlated. The effect of this erroneousanalysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimatedstandard error is too small and does not fully reflect the actual imprecision in the estimate.

Solution: In order to be confident that the reported standard error really reflects the uncertainty ofthe estimate, it is important that the analytical methods are appropriate for the survey design. The properanalysis treats the clusters as a random sample from the population of clusters. The methods of simplerandom samples are applied to the cluster summary statistics (Thompson, 1992, Chapter 12).

3.2.4 Multi-stage sampling

In many situations, there are natural divisions of the population into several different sizes of units. Forexample, a forest management unit consists of several stands, each stand has several cutblocks, and eachcutblock can be divided into plots. These divisions can be easily accommodated in a survey through theuse of multi-stage methods. Selection of units is done in stages. For example, several stands could beselected from a management area; then several cutblocks are selected in each of the chosen stands; thenseveral plots are selected in each of the chosen cutblocks. Note that in a multi-stage design, units at anystage are selected at random only from those larger units selected in previous stages.

Again consider the vegetation survey of previous sections. The population is again divided into 60clusers of size 8. However, rather than surveying all units within a cluster, we decide to survey onlytwo units within each cluster. Hence, we now sample at the first stage, a total of 12 clusters out of the60. In each cluster, we randomly sample 2 of the 8 units. A sample plan might look like the followingwhere the rectangles indicate the clusters selected, and the checks indicate the sub-sample taken fromeach cluster:

c©2019 Carl James Schwarz 115 2019-11-03

Page 18: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The advantage of multi-stage designs are that costs can be reduced compared to a simple randomsample of the same size, primarily through improved logistics. The precision of the results is worse thanan equivalent simple random sample, but because costs are less, a larger multi-stage survey can often bedone for the same costs as a smaller simple random sample. This often results in a more precise estimatefor the same cost. However, due to the misuse of data from complex designs, simple designs are oftenhighly preferred and end up being more cost efficient when costs associated with incorrect decisions areincorporated.

c©2019 Carl James Schwarz 116 2019-11-03

Page 19: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Pitfall: Although random selections are made at each stage, a common error is to analyze these typesof surveys as if they arose from a simple random sample. The plots were not independently selected; ifa particular cut- block was not chosen, then none of the plots within that cutblock can be chosen. As incluster samples, the consequences of this erroneous analysis are that the estimated standard errors are toosmall and do not fully reflect the actual imprecision in the estimates. A manager will be more confidentin the estimate than is justified by the survey.

Solution: Again, it is important that the analytical methods are suitable for the sampling design. Theproper analysis of multi-stage designs takes into account that random samples takes place at each stage(Thompson, 1992, Chapter 13). In many cases, the precision of the estimates is determined essentiallyby the number of first stage units selected. Little is gained by extensive sampling at lower stages.

3.2.5 Multi-phase designs

In some surveys, multiple surveys of the same survey units are performed. In the first phase, a sample ofunits is selected (usually by a simple random sample). Every unit is measured on some variable. Thenin subsequent phases, samples are selected ONLY from those units selected in the first phase, not fromthe entire population.

For example, refer back to the vegetation survey. An initial sample of 24 plots is closen in a simplerandom survey. Aerial flights are used to quickly measure some characteristic of the plots. A secondphase sample of 6 units (circled below) is then measured using ground based methods.

c©2019 Carl James Schwarz 117 2019-11-03

Page 20: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Multiphase designs are commonly used in two situations. First, it is sometimes difficult to stratify apopulation in advance because the values of the stratification variables are not known. The first phaseis used to measure the stratification variable on a random sample of units. The selected units are thenstratified, and further samples are taken from each stratum as needed to measure a second variable. Thisavoids having to measure the second variable on every unit when the strata differ in importance. Forexample, in the first phase, plots are selected and measured for the amount of insect damage. The plotsare then stratified by the amount of damage, and second phase allocation of units concentrates on plots

c©2019 Carl James Schwarz 118 2019-11-03

Page 21: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

with low insect damage to measure total usable volume of wood. It would be wasteful to measure thevolume of wood on plot with much insect damage.

The second common occurrence is when it is relatively easy to measure a surrogate variable (relatedto the real variable of interest) on selected units, and then in the second phase, the real variable ofinterest is measured on a subset of the units. The relationship between the surrogate and desired variablein the smaller sample is used to adjust the estimate based on the surrogate variable in the larger sample.For example, managers need to estimate the volume of wood removed from a harvesting area. A largesample of logging trucks is weighed (which is easy to do), and weight will serve as a surrogate variablefor volume. A smaller sample of trucks (selected from those weighed) is scaled for volume and therelationship between volume and weight from the second phase sample is used to predict volume basedon weight only for the first phase sample. Another example is the count plot method of estimatingvolume of timber in a stand. A selection of plots is chosen and the basal area determined. Then a sub-selection of plots is rechosen in the second phase, and volume measurements are made on the secondphase plots. The relationship between volume and area in the second phase is used to predict volumefrom area measurements seen the first phase.

3.2.6 Panel design - suitable for long-term monitoring

One common objective of long-term studies is to investigate changes over time of a particular population.There are three common designs.

First, separate independent surveys can be conducted at each time point. This is the simplest designto analyze because all observations are independent over time. For example, independent surveys can beconducted at five year intervals to assess regeneration of cutblocks. However, precision of the estimatedchange may be poor because of the additional variability introduced by having new units sampled at eachtime point.

At the other extreme, units are selected in the first survey, permanent monitoring stations are es-tablished and the same units are remeasured over time. For example, permanent study plots can beestablished that are remeasured for regeneration over time. Ceteris paribus (all else being equal), thisdesign is the more efficient (i.e. has higher power) compared to the previous design. The advantage ofpermanent study plots occurs because in comparisons over time, the effects of that particular monitor-ing site tend to cancel out and so estimates of variability are free of additional variability introduced bynew units being measured at every time point. One possible problem is that survey units may become‘damaged’ over time, and the sample size will tend to decline over time resulting in a loss of power.Additionally, an analysis of these types of designs is more complex because of the need to account forthe correlation over time of measurements on the same sample plot and the need to account for possiblemissing values when units become ‘damaged’ and are dropped from the study.

A compromise between these two design are partial replacement designs or panel designs. In thesedesigns, a portion of the survey units are replaced with new units at each time point. For example,1/5 of the units could be replaced by new units at each time point - units would normally stay in thestudy for a maximum of 5 time periods. This design combines the advantages of repeatedly measuringsemi-permanent monitoring stations with the ability to replace (or refresh) the sample if units becomedamaged or are lost. The analysis of these designs is non-trival, but manageable with modern software.

c©2019 Carl James Schwarz 119 2019-11-03

Page 22: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.2.7 Sampling non-discrete objects

In some cases, the population does not have natural discrete sampling units. For example, a large sectionof land may be arbitrarily divided into 1m2 plots, or 10m2 plots. A natural question to ask is what is the‘best size’ of unit. This has no simple answer and depends upon several factors which must be addressedfor each survey:

• Cost. All else being equal, sampling many small plots may be more expensive than samplingfewer larger plots. The primary difference in cost is the overhead in traveling and setup to measurethe unit.

• Size of unit. An intuitive feeling is that more smaller plots are better than few large plots be-cause the sample size is larger. This will be true if the characteristic of interest is ‘patchy’ , butsurprisingly, makes no difference if the characteristic is randomly scattered through out the area(Krebs, 1989, p. 64). Indeed if the characteristic shows ‘avoidance’, then larger plots are better.For example, competition among trees implies they are spread out more than expected if they wererandomly located. Logistic considerations often influence the plot size. For example, if tramplingthe soil affects the response, then sample plots must be small enough to measure without tramplingthe soil.

• Edge effects. Because the population does not have natural boundaries, decisions often have tobe made about objects that lie on the edge of the sample plot. In general larger square or circularplots are better because of smaller edge-to-area ratio. [A large narrow rectangular plot can havemore edge than a similar area square plot.]

• Size of object being measured. Clearly a 1 m2 plot is not appropriate when counting matureDouglas-fir, but may be appropriate for a lichen survey.

A pilot study should be carried out prior to a large scale survey to investigate factors that influencethe choice of sampling unit size.

3.2.8 Key considerations when designing or analyzing a survey

Key considerations when designing a survey are

• what are the objectives of the survey?

• what is the sampling unit? This should be carefully distinguished from the observational unit. Forexample, you may sample boats returning from fishing, but interview the individual anglers on theboat.

• What frame is available (if any) for the sampling units? If a frame is available, then direct samplingcan be used where the units can be numbered and the randomization used to select the samplingunits. If no frame is available, then you will need to figure out how to identify the units and howto select then on the fly. For example, there is no frame of boats returning to an access point, soperhaps a systematic survey of every 5th boat could be used.

• Are all the sampling units are the same size? If so, then a simple random sample (or variant thereof)is likely a suitable design. If the units vary considerably in size, then an unequal probability designmay be more suitable. For example, if your survey units are forest polygons (as displayed ona GIS), these polygons vary considerably in size with many smaller polygons and fewer largerpolygons. A design that selects polygons with a probability proportional to the size of the polygonmay be more suited than a simple random sample of polygons.

c©2019 Carl James Schwarz 120 2019-11-03

Page 23: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• Decide upon the sampling design used (i.e. simple random sample, or cluster sample, or multi-state design, etc.) The availablity of the frame and the existence of different sized sampling unitswill often dictate the type of design used.

• What precision is required for the estimate? This (along with the variability in the response) willdetermine the sample size needed.

• If you are not stratifying your design, then why not? Stratification is a low-cost or no-cost way toimprove your survey.

When analyzing a survey, the key steps are:

• Recognize the design that was used to collect the data. Key pointers to help recognize variousdesigns are:

– How were the units selected? A true simple random sample makes a list of all possible itemsand then chooses from that list.

– Is there more than one size of sampling unit? For example, were transects selected at random,and then quadrats within samples selected at random? This is usually a multi-stage design.

– Is there a cluster? For example, transects are selected, and these are divided into a series ofquadrats - all of which are measured.

Any analysis of the data must use a method that matches the design used to collect the data!

• Are there missing values? How did they occur? If the missingness is MCAR, then life is easyand the analysis proceeds with a reduced sample size. If the missingness is MAR, then somereweighting of the observed data will be required. If the missingness is IM, seek help - this is adifficult problem.

• Use a suitable package to analyze the results (avoid Excel except for the simplest designs!).

• Report both the estimate and the measure of precision (the standard error).

3.3 Notation

Unfortunately, sampling theory has developed its own notation that is different than that used for designof experiments or other areas of statistics even though the same concepts are used in both. It would benice to adopt a general convention for all of statistics - maybe in 100 years this will happen.

Even among sampling textbooks, there is no agreement on notation! (sigh).

In the table below, I’ve summarized the “usual” notation used in sampling theory. In general, largeletters refer to population values, while small letters refer to sample values.

c©2019 Carl James Schwarz 121 2019-11-03

Page 24: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Characteristic Population value Sample value

number of elements N n

units Yi yi

total τ =N∑i=1

Yi y =n∑i=1

yi

mean µ = 1N

N∑i=1

Yi y = 1n

n∑i=1

yi

proportion π = τN p = y

n

variance S2 =N∑i=1

(Yi−µ)2

N−1 s2 =n∑i=1

(yi−y)2

n−1

variance of a prop S2 = NN−1π(1− π) s2 = np(1−p)

n−1

Note:

• The population mean is sometimes denoted as Y in many books.

• The population total is sometimes denoted as Y in many books.

• Again note the distinction between the population quantity (e.g. the population mean µ) and thecorresponding sample quantity (e.g. the sample mean y).

3.4 Simple Random Sampling Without Replacement (SRSWOR)

This forms the basis of many other more complex sampling plans and is the ‘gold standard’ against whichall other sampling plans are compared. It often happens that more complex sampling plans consist of aseries of simple random samples that are combined in a complex fashion.

In this design, once the frame of units has been enumerated, a sample of size n is selected withoutreplacement from the N population units.

Refer to the previous sections for an illustration of how the units will be selected.

3.4.1 Summary of main results

It turns out that for a simple random sample, the sample mean (y) is the best estimator for the populationmean (µ). The population total is estimated by multiplying the sample mean by the POPULATION size.And, a proportion is estimated by simply coding results as 0 or 1 depending if the sampled unit belongsto the class of interest, and taking the mean of these 0,1 values. (Yes, this really does work - refer to alater section for more details).

As with every estimate, a measure of precision is required. We say in an earlier chapter that thestandard error (se) is such a measure. Recall that the standard error measures how variable the resultsof our survey would be if the survey were to be repeated. The standard error for the sample mean looksvery similar to that for a sample mean from a completely randomized design (refer to later chapters) witha common correction of a finite population factor (the (1− f) term).

The standard error for the population total estimate is found by multiplying the standard error for themean by the POPULATION SIZE.

c©2019 Carl James Schwarz 122 2019-11-03

Page 25: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The standard error for a proportion is found again, by treating each data value as 0 or 1 and applyingthe same formula as the standard error for a mean.

The following table summarizes the main results:

Parameter Population value Estimator Estimated se

Mean µ µ = y√

s2

n (1− f)

Total τ τ = N × µ = Nyy N × se(µ) = N√

s2

n (1− f)

Proportion π π = p = y0/1 = yn

√p(1−p)n−1 (1− f)

Notes:

• Inflation factor The term N/n is called the inflation factor and the estimator for the total is some-times called the expansion estimator or the simple inflation estimator.

• Sampling weight Many statistical packages that analyze survey data will require the specificationof a sampling weight. A sampling weight represent how many units in the population are repre-sented by this unit in the sample. In the case of a simple random sample, the sampling weight isalso equal to N/n. For example, if you select 10 units at random from 150 units in the population,the sampling weight for each observation is 15, i.e. each unit in the sample represents 15 unitsin the population. The sampling weights are computed differently for various designs so won’talways be equal to N/n.

• sampling fraction the term n/N is called the sampling fraction and is denoted as f .

• finite population correction (fpc) the term (1−f) is called the finite population correction factorand reflects that if you sample a substantial part of the population, the standard error of the esti-mator is smaller than what would be expected from experimental design results. If f is less than5%, this is often ignored. In most ecological studies the sampling fraction is usually small enoughthat all of the fpc terms can be ignored.

3.4.2 Estimating the Population Mean

The first line of the above table shows the “basic” results and all the remaining lines in the table can bederived from this line as will be shown later.

The population mean (µ) is estimated by the sample mean (y). The estimated se of the sample meanis

se(y) =

√s2

n(1− f) =

s√n

√(1− f)

Note that if the sampling fraction (f) is small, then the standard error of the sample mean can be approx-imated by:

se(y) ≈√s2

n=

s√n

which is the familiar form seen previously. In general, the standard error formula changes dependingupon the sampling method used to collect the data and the estimator used on the data. Everydifferent sampling design has its own way of computing the estimator and se.

c©2019 Carl James Schwarz 123 2019-11-03

Page 26: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Confidence intervals for parameters are computed in the usual fashion, i.e. an approximate 95%confidence interval would be found as: estimator±2se. Some textbooks use a t-distribution for smallersample sizes, but most surveys are sufficiently large that this makes little difference.

3.4.3 Estimating the Population Total

Many students find this part confusing, because of the term population total. This does NOT refer tothe total number of units in the population, but rather the sum of the individual values over the units.For example, if you are interested in estimating total timber volume in an inventory unit, the trees arethe sampling units. A sample of trees is selected to estimate the mean volume per tree. The total timbervolume over all trees in the inventory unit is of interest, not the total number of trees in the inventoryunit.

As the population total is found by Nµ (total population size times the population mean), a naturalestimator is formed by the product of the population size and the sample mean, i.e. TOTAL = τ = Ny.Note that you must multiply by the population size not the sample size.

Its estimated se is found by multiplying the estimated se for the sample mean by the population sizeas well, i.e.,

se(τ) = N

√s2

n(1− f)

In general, estimates for population totals in most sampling designs are found by multiplying esti-mates of population means by the population size.

Confidence intervals are found in the usual fashion.

3.4.4 Estimating Population Proportions

A “standard trick” used in survey sampling when estimating a population proportion is to replace theresponse variable by a 0/1 code and then treat this coded data in the same way as ordinary data.

For example, suppose you were interested the proportion of fish in a catch that was of a particularspecies. A sample of 10 fish were selected (of course in the real world, a larger sample would be taken),and the following data were observed (S=sockeye, C=chum):

S C C S S S S C S S

Of the 10 fish sampled, 3 were chum so that the sample proportion of fish that were chum is 3/10 = 0.30.

If the data are recoded using 1=Chum, 0=Sockeye, the sample values would be:

0 1 1 0 0 0 0 1 0 0

The sample average of these numbers gives y = 3/10 = 0.30 which is exactly the proportion seen.

It is not surprising then that by recoding the sample using 0/1 variables, the first line in the summarytable reduces to the last line in the summary table. In particular, s2 reduces to np(1−p)/(n−1) resultingin the se seen above.

c©2019 Carl James Schwarz 124 2019-11-03

Page 27: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Confidence intervals are computed in the usual fashion.

3.4.5 Example - estimating total catch of fish in a recreational fishery

This will illustrate the concepts in the previous sections using a very small illustrative example.

For management purposes, it is important to estimate the total catch by recreational fishers. Unfor-tunately, there is no central registry of fishers, nor is there a central reporting station. Consequently,surveys are often used to estimate the total catch.

There are two common survey designs used in these types of surveys (generically called creel sur-veys). In access surveys, observers are stationed at access points to the fishery. For example, if fishersgo out in boats to catch the fish, the access points are the marinas where the boats are launched and arereturned. From these access points, a sample of fishers is selected and interviews conducted to measurethe number of fish captured and other attributes. Roving surveys are commonly used when there isno common access point and you can move among the fishers. In this case, the observer moves aboutthe fishery and questions anglers as they are encountered. Note that in this last design, the chances ofencountering an angler are no longer equal - there is a greater chance of encountering an angler whohas a longer fishing episode. And, you typically don’t encounter the angler at the end of the episode butsomewhere in the middle of the episode. The analysis of roving surveys is more complex - seek help.The following example is based on a real life example from British Columbia. The actual survey is muchlarger involving several thousand anglers and sample sizes in the low hundreds, but the basic idea is thesame.

An access survey was conducted to estimate the total catch at a lake in British Columbia. Fortunately,access to the lake takes place at a single landing site and most anglers use boats in the fishery. An observerwas stationed at the landing site, but because of time constraints, could only interview a portion of theanglers returning, but was able to get a total count of the number of fishing parties on that day. A totalof 168 fishing parties arrived at the landing during the day, of which 30 were sampled. The decision tosample an fishing party was made using a random number table as the boat returned to the dock.

The objectives are to estimate the total number of anglers and their catch and to estimate the propor-tion of boat trips (fishing parties) that had sufficient life-jackets for the members on the trip. Here is the

c©2019 Carl James Schwarz 125 2019-11-03

Page 28: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

raw data - each line is the results for a fishing party..

Number Party SufficientAnglers Catch Life Jackets?1 1 yes3 1 yes1 2 yes1 2 no3 2 no3 1 yes1 0 no1 0 no1 1 yes1 0 yes2 0 yes1 1 yes2 0 yes1 2 yes3 3 yes1 0 no1 0 yes2 0 yes3 1 yes1 0 yes2 0 yes1 1 yes1 0 yes1 0 yes1 0 no2 0 yes2 1 no1 1 no1 0 yes1 0 yes

What is the population of interest?

The population of interest is NOT the fish in the lake. The Fisheries Department is not interested inestimating the characteristics of the fish, such as mean fish weight or the number of fish in the lake.Rather, the focus is on the anglers and fishing parties. Refer to the FAQ at the end of the chapter formore details.

It would be tempting to conclude that the anglers on the lake are the population of interest. However,note that information is NOT gathered on individual anglers. For example, the number of fish capturedby each angler in the party is not recorded - only the total fish caught by the party. Similarly, it isimpossible to say if each angler had an individual life jacket - if there were 3 anglers in the boat and onlytwo life jackets, which angler was without? 1

1If data were collected on individual anglers, then the anglers could be taken as the population of interest. However, in thiscase, the design is NOT a simple random sample of anglers. Rather, as you will see later in the course, the design is a clustersample where a simple random sample of clusters (boats) was taken and all members of the cluster (the anglers) were interviewed.As you will see later in the course, a cluster sample can be viewed as a simple random sample if you define the population in termsof clusters.

c©2019 Carl James Schwarz 126 2019-11-03

Page 29: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

For this reason, the the population of interest is taken to be the set of boats fishing at this lake. Thefisheries agency doesn’t really care about the individual anglers because if a boat with 3 anglers catchesone fish, the actual person who caught the fish is not recorded. Similarly, if there are only two life jackets,does it matter which angler didn’t have the jacket?

Under this interpretation, the design is a simple random sample of boats returning to the landing.

What is the frame?

The frame for a simple random sample is a listing of ALL the units in the population. This list is thenused to randomly select which units will be measured. In this case, there is no physical list and the frameis conceptual. A random number table was used to decide which fishing parties to interview.

What is the sampling design and sampling unit?

The sampling design will be treated as if it were a simple random sample from all boats (fishing parties)returning, but in actual fact was likely a systematic sample or variant. As you will see later, this may ormay not be a problem.

In many cases, special attention should be paid to identify the correct sampling unit. Here the sam-pling unit is a fishing party or boat, i.e. the boats were selected, not individual anglers. This mistake isoften made when the data are presented on an individual basis rather than on a sampling unit basis. Asyou will see in later chapters, this is an example of pseudo-replication.

Excel analysis

As mentioned earlier, Excel should be used with caution in statistical analysis. However, for very simplesurveys, it is an adequate tool.

A copy of a sample Excel worksheet called creel is available in the AllofData workbook in the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

Here is a condensed view of the spreadsheet within the workbook:

c©2019 Carl James Schwarz 127 2019-11-03

Page 30: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

c©2019 Carl James Schwarz 128 2019-11-03

Page 31: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The analysis proceeds in a series of logical steps as illustrated for the number of anglers in each partyvariable.

Enter the data on the spreadsheet

The metadata (information about the survey) is entered at the top of the spreadsheet.

The actual data is entered in the middle of the sheet. One row is used to list the variables recordedfor each angling party.

Obtain the required summary statistics.

At the bottom of the data, the summary statistics needed are computed using the Excel built-in functions.This includes the sample size, the sample mean, and the sample standard deviation.

Obtain estimates of the population quantity

Because the sample mean is the estimator for the population mean in if the design is a simple randomsample, no further computations are needed.

In order to estimate the total number of angler, we multiply the average number of anglers in eachfishing party (1.533 angler/party) by the POPULATION SIZE (the number of fishing parties for the entireday = 168) to get the estimated total number of anglers (257.6).

Obtain estimates of precision - standard errors

The se for the sample mean is computed using the formula presented earlier. The estimated standarderror OF THE MEAN is 0.128 anglers/party.

Because we found the estimated total by multiplying the estimates of the mean number of an-glers/boat trip times the number of boat trips (168), the estimated standard error of the POPULATIONTOTAL is found by multiplying the standard error of the sample mean by the same value, 0.128x168 =21.5 anglers.

Hence, a 95% confidence interval for the total number of anglers fishing this day is found as 257.6±2(21.5).

Estimating total catch

The next column uses a similar procedure is followed to estimate the total catch.

Estimating proportion of parties with sufficient life-jackets

First, the character values yes/no are translated into 0,1 variables using the IF statement of Excel.

Then the EXACT same formula as used for estimating the total number of anglers or the total catchis applied to the 0,1 data!

We estimate that 73.3% of boats have sufficient life-jackets with a se of 7.4 percentage points.

c©2019 Carl James Schwarz 129 2019-11-03

Page 32: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

SAS analysis

SAS (Version 8 or higher) has procedures for analyzing survey data. Copies of the sample SAS programcalled creel.sas and the output called creel.lst are available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The program starts with the Data step that reads in the data and creates the metadata so that thepurpose of the program and how the data were collected etc are not lost.

data creel; /* read in the survey data */infile ’creel.csv’ dlm=’,’ dsd missover;input angler catch lifej $;enough = 0;if lifej = ’yes’ then enough = 1;

The first section of code reads the data and computes the 0,1 variable from the life-jacket information. AProc Print (not shown) lists the data so that it can be verified that it was read correctly.

Most program for dealing with survey data require that sampling weights be available for eachobservation.

data creel;set creel;sampweight = 168/30;

run;

A sampling weight is the weighting factor representing how many people in the population this observa-tion represents. In this case, each of the 30 parties represents 168/30=5.6 parties in the population. Thesampling weights need not be specified if totals are not being estimated (only means); SAS then assignsequal weight to each observations which is appropriate for a simple random sample.

Finally, Proc SurveyMeans is used to estimate the quantities of interest.

proc surveymeans data=creeltotal=168 /* total population size */mean clm /* find estimates of mean, its se, and a 95% confidence interval */sum clsum /* find estimates of total,its se, and a 95% confidence interval */;

var angler catch lifej ; /* estimate mean and total for numeric variables, proportions for char variables */weight sampweight;/* Note that it is not necessary to use the coded 0/1 variables in this procedure */ods output statistics=creelresults;

run;

It is not necessary to code any formula as these are builtin into the SAS program. So how does theSAS program know this is a simple random sample? This is the default analysis - more complex designsrequire additional statements (e.g. a CLUSTER statement) to indicate a more complex design. As well,equal sampling weights indicate that all items were selected with equal probability.

c©2019 Carl James Schwarz 130 2019-11-03

Page 33: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Here are portions of the SAS output

VariableName

VariableLevel Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

angler 1.533 0.128 1.272 1.795 257.600 21.496 213.635 301.565

catch 0.667 0.139 0.382 0.951 112.000 23.382 64.177 159.823

lifej Suff.Jac 0.032 0.029 0.000 0.092 5.600 5.057 0.000 15.928

lifej no 0.258 0.072 0.111 0.405 44.800 12.524 19.223 70.377

lifej yes 0.710 0.075 0.557 0.863 123.200 12.992 96.667 149.733

All of the results match that from the Excel spreadsheet.

3.5 Sample size determination for a simple random sample

I cannot emphasize too strongly, the importance of planning in advance of the survey.

There are many surveys where the results are disappointing. For example, a survey of anglers mayshow that the mean catch per angler is 1.3 fish but that the standard error is .9 fish. In other words, a95% confidence interval stretches from 0 to well over 4 fish per angler, something that is known withnear certainty even before the survey was conducted. In many cases, a back of the envelope calculationhas showed that the precision obtained from a survey would be inadequate at the proposed sample sizeeven before the survey was started.

In order to determine the appropriate sample size, you will need to first specify some measure ofprecision that is required to be obtained. For example, a policy decision may require that the results beaccurate to within 5% of the population value.

This precision requirement usually occurs in one of two formats:

• an absolute precision, i.e. you wish to be 95% confident that the sample mean will not vary fromthe population mean by a pre-specified amount. For example, a 95% confidence interval for thetotal number of fish captured should be ± 1,000 fish.

• a relative precision, i.e. you wish to be 95% confident that the sample mean will be within 10%of the population mean.

The latter is more common than the former, but both are equivalent and interchangeable. For exam-ple, if the actual estimate is around 200, with a se of about 50, then the 95% confidence interval is ±100 and the relative precision is within 50% of the population answer (± 100 / 200). Conversely, a 95%confidence interval that is within ± 40% of the estimate of 200, turns out to be ± 80 (40% of 200), andconsequently, the se is around 40 (=80/2).

A common question is:

What is the difference between se/est and 2se/est? When is the relative standard error di-vided by 2? Does se/est have anything to do with a 95 % ci?

c©2019 Carl James Schwarz 131 2019-11-03

Page 34: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Precision requirements are stated in different ways (replace blah below by mean/total/proportion etc).

Expression Mathematics

- within xxx of the blah se = xxx

- margin of error of xxx 2se = xxx

- within xxx of the population value 19 times out of 20 2se = xxx

- within xxx of the population value 95% of the time 2se = xxx

- the width of the 95% confidence interval is xxx 4se = xxx

- within 10% of the blah se/est = .10

- a rse of 10% se/est = .10

- a relative error of 10% se/est = .10

- within 10% of the blah 95% of the time 2se/est = .10

- within 10% of the blah 19 times out of 20 2se/est = .10

- margin of error of 10% 2se/est = .10

- width of 95% confidence interval = 10% of the blah 4se/est = .10

As a rough rule of thumb, the following are often used as survey precision guidelines:

• For preliminary surveys, the 95% confidence interval should be ± 50% of the estimate. Thisimplies that the target rse is 25%.

• For management surveys, the 95% confidence interval should be ± 25% of the estimate. Thisimplies that the target rse is 12.5%.

• For scientific work, the 95% confidence interval should be ± 10% of the estimate. This impliesthat the target rse is 5%.

Next, some preliminary guess for the standard deviation of individual items in the population (S)needs to be taken along with an estimate of the population size (N ) and possibly the population mean(µ) or population total (τ ). These are not too crucial and can be obtained by:

• taking a pilot study.

• previous sampling of similar populations

• expert opinion

A very rough estimate of the standard deviation can be found by taking the usual range of the data/4. Ifthe population proportion is unknown, the value of 0.5 is often used as this leads to the largest samplesize requirement as a conservative guess.

These are then used with the formulae for the confidence interval to determine the relevant samplesize. Many text books have complicated formulae to do this - it is much easier these days to simply code

c©2019 Carl James Schwarz 132 2019-11-03

Page 35: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

the formulae in a spreadsheet (see examples) and use either trial and error to find a appropriate samplesize, or use the “GOAL SEEKER” feature of the spreadsheet to find the appropriate sample size. Thiswill be illustrated in the example.

As an approximated answer, recall that se usally vary by√n. Suppose that the present rse is .07. A

rse of 5%, is smaller by a factor of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in thesample size.

If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigatethe effect of sample size upon the se. For each different bootstrap sample size, estimate the parameter, therse and then increase the sample size until the require rse is obtained. This is relatively easy to do in SASusing the Proc SurveySelect that can select samples of arbitrary size. In some packages, such as JMP,sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. Inthis case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then usethe Table→Subset→Random Sample Size to get the approximate bootstrap sample. Again compute theestimate and its rse , and increase the sample size until the required precision is obtained.

The final sample size is not to be treated as the exact sample size but more as a guide to the amountof effort that needs to be expended. Remember that “guesses” are being made for the standard deviation,the require precision, the approximate value of the estimate etc. Consequently, there really isn’t a defen-sible difference between a required sample size of 30 and 40. What really is of interest is the order ofmagnitude of effort required. For example, if your budget allows for a sample size of 20, and the samplesize computation show that a sample size of 200 is required, then doing the survey with a sample size of20 is a waste of time and money. If the required sample size is about 30, then you may be ok with anactual sample size of 20.

If more than one item is being surveyed, these calculations must be done for each item. The largestsample size needed is then chosen. This may lead to conflict in which case some response items must bedropped or a different sampling method must be used for this other response variable.

Precision essentially depends only the absolute samplesize, not the relative fraction of the population sampled.For example, a sample of 1000 people taken from Canada (population of 33,000,000) is just as precise asa sample of 1000 people taken from the US (population of 333,000,000)! This is highly counter-intuitiveand will be explored more in class.

3.5.1 Example - How many angling-parties to survey

We wish to repeat the angler creel survey next year.

• How many angling-parties should be interviewed to be 95% confident of being with 10% of thepopulation mean catch?

• What sample size would be needed to estimate the proportion of boats within 3 percentage points19 times out of 20? In this case we are asking that the 95% confidence interval be ±0.03 or thatthe se = 0.015.

The sample size spreadsheet is available in an Excel workbook called SurveySampleSize.xls whichcan be downloaded from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

c©2019 Carl James Schwarz 133 2019-11-03

Page 36: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

A SAS program to compute sample size is also available, but in my opinion, is neither user-friendlynor as flexible for the general user. The code and output is also available in the Sample Program Libraryreferred to above.

Here is a condensed view of the spreadsheet:

c©2019 Carl James Schwarz 134 2019-11-03

Page 37: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

c©2019 Carl James Schwarz 135 2019-11-03

Page 38: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

First note that the computations for sample size require some PRIOR information about populationsize, the population mean, or the population proportion. We will use information from the previoussurvey to help plan future studies.

For example, about 168 boats were interviewed last year. The mean catch per angling party wasabout .667 fish/boat. The standard deviation of the catch per party was .844. These values are entered inthe spreadsheet in column C.

A preliminary sample size of 40 (in green in Column C) was tried. This lead to a 95% confidenceinterval of ± 35% which did not meet the precision requirements.

Now vary the sample size (in green) in column C until the 95% confidence interval (in yellow) isbelow ± 10%. You will find that you will need to interview almost 135 parties - a very high samplingfraction indeed. The problem for this variable is the very high variation of individual data points.

If you are familiar with Excel, you can use the Goal Seeker function to speed the search.

Similarly, the proportion of people wearing lifejackets last year was around 73%. Enter this in theblue areas of Column E. The initial sample size of 20 is too small as the 95% confidence interval is ±.186 (18 percentage points). Now vary the sample size (in green) until the 95% confidence interval is ±.03. Note that you need to be careful in dealing with percentages - confidence limits are often specified interms of percentage points rather than percents to avoid problems where percents are taken of percents.This will be explained further in class.

Try using the spreadsheet to compare the precision of a poll of 1000 people taken from Canada(population 33,000,000) and 1000 people taken from the US (population 330,000,000) if both polls haveabout 40% in favor of some issue.

Technical notes

If you really want to know how the sample size numbers are determined, here is the lowdown.

Suppose that you wish to be 95% sure that the sample mean is within 10% of the population mean.

We must solve z S√n

√N−nN ≤ εµ for nwhere z is the term representing the multiplier for a particular

confidence level (for a 95% c.i. use z = 2) and ε is the ‘closeness’ factor (in this case ε = 0.10).

Rearranging this equation gives n = N

1+N( εµzS )2

3.6 Systematic sampling

Sometimes, logistical considerations make a true simple random sample not very convenient to adminis-ter. For example, in the previous creel survey, a true random sample would require that a random numberbe generated for each boat returning to the marina. In such cases, a systematic sample could be used toselect elements. For example, every 5th angler could be selected after a random starting point.

3.6.1 Advantages of systematic sampling

The main advantages of systematic sampling are:

c©2019 Carl James Schwarz 136 2019-11-03

Page 39: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• it is easier to draw units because only one random number is chosen

• if a sampling frame is not available but there is a convenient method of selecting items, e.g. thecreel survey where every 5th angler is chosen.

• easier instructions for untrained staff

• if the population is in random order relative to the variable being measured, the method is equiva-lent to a SRS. For example, it is unlikely that the number of anglers in each boat changes dramat-ically over the period of the day. This is an important assumption that should be investigatedcarefully in any real life situation!

• it distributes the sample more evenly over the population. Consequently if there is a trend, youwill get items selected from all parts of the trend.

3.6.2 Disadvantages of systematic sampling

The primary disadvantages of systematic sampling are:

• Hidden periodicities or trends may cause biased results. In such cases, estimates of mean andstandard errors may be severely biased! See Section 4.2.2 for a detailed discussion.

• Without making an assumption about the distribution of population units, there is no estimate ofthe standard error. This is an important disadvantage of a systematic sample! Many studiesvery casually make the assumption that the systematic sample is equivalent to a simple randomsample without much justification for this.

3.6.3 How to select a systematic sample

There are several methods, depending if you know the population size, etc. Suppose we need to chooseevery kth record, where k is chosen to meet sample size requirements. - an example of choosing k willbe given in class. All of the following methods are equivalent if k divides N exactly. These are the twomost common methods.

• Method 1 Choose a random number j from 1 · · · k.. Then choose the j, j+k, j+2k, · · · records.One problem is that different samples may be of different size - an example will be given in classwhere n doesn’t divide N exactly. This causes problems in sampling theory, but not too much ofa problem if n is large.

• Method 2 Choose a random number from 1 · · ·N . Choose very kthitem and continue in a circlewhen you reach the end until you have selected n items. This will always give you the same sizedsample, however, it requires knowledge of N

3.6.4 Analyzing a systematic sample

Most surveys casually assume that the population has been sorted in random order when the systematicsample was selected and so treat the results as if they had come from a SRSWOR. This is theoreticallynot correct and if your assumption is false, the results may be biased, and there is no way of examiningthe biases from the data at hand.

c©2019 Carl James Schwarz 137 2019-11-03

Page 40: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Before implementing a systematic survey or analyzing a systematic survey, please consult with anexpert in sampling theory to avoid problems. This is a case where an hour or two of consultation beforespending lots of money could potentially turn a survey where nothing can be estimated, into a surveythat has justifiable results.

3.6.5 Technical notes - Repeated systematic sampling

To avoid many of the potential problems with systematic sampling, a common device is to use repeatedsystematic samples on the same population.

For example, rather than taking a single systematic sample of size 100 from a population, you cantake 4 systematic samples (with different starting points) of size 25.

An empirical method of obtaining a standard error from a systematic sample is to use repeatedsystematic sampling. Rather than choosing one systematic subsample of every kth unit, choose, mindependent systematic subsample of size n/m. Then estimate the mean of each sub-systematic sample.Treat these means as a simple random sample from the population of possible systematic samples anduse the usual sampling theory. The variation of the estimate among the sub-systematic samples providesan estimate of the standard error (after an appropriate adjustment). This will be illustrated in an example.

Example of replicated subsampling within a systematic sample

A yearly survey has been conducted in the Prairie Provinces to estimate the number of breeding pairs ofducks. One breeding area has been divided into approximately 1000 transects of a certain width, i.e. thebreeding area was divided into 1000 strips.

What is the population of interest? As noted in class, the definition of a population depends, in part,upon the interest of the researcher. Two possible definitions are:

• The population is the set of individual ducks on the study area. However, no frame exists for theindividual birds. But a frame can be constructed based on the 1000 strips that cover the study area.In this case, the design is a cluster sample, with the clusters being strips.

• The population consists of the 1000 strips that cover the study area and the number of ducks ineach strip is the response variable. The design is then a simple random sample of the strips.

In either case, the analysis is exactly the same and the final estimates are exactly the same.

Approximately 100 of the transects are flown by an aircraft and spotters on the aircraft count thenumber of breeding pairs visible from the aircraft.

For administrative convenience, it is easier to conduct systematic sampling. However, there is struc-ture to the data; it is well known that ducks do not spread themselves randomly through out the breedingarea. After discussions with our Statistical Consulting Service, the researchers flew 10 sets of replicatedsystematic samples; each set consisted of 10 transects. As each transect is flown, the scientists alsoclassify each transect as ‘prime’ or ‘non-prime’ breeding habitat.

Here is the raw data reporting the number of nests in each set of 10 transects:

c©2019 Carl James Schwarz 138 2019-11-03

Page 41: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Prime Non-Prime Non-

Set Habitat Habitat ALL Prime prime

Total n Total n Total mean mean Diff

(b) (a) (c) (d) (e)

1 123 3 345 7 468 41.0 49.3 -8.3

2 57 2 36 8 93 28.5 4.5 24.0

3 85 5 46 5 131 17.0 9.2 7.8

4 97 2 131 8 228 48.5 16.4 32.1

5 34 5 43 5 77 6.8 8.6 -1.8

6 85 3 67 7 152 28.3 9.6 18.8

7 56 7 64 3 120 8.0 21.3 -13.3

8 46 2 65 8 111 23.0 8.1 14.9

9 37 4 43 6 80 9.3 7.2 2.1

10 93 2 104 8 197 46.5 13.0 33.5

Avg 71.3 165.7 10.97

s 29.5 117.0 16.38

n 10 10 10

Est

Est total 7130 16570 mean 10.97

Est se 885 3510 se 4.91

Several different estimates can be formed.

1. Total number of nests in the breeding area (refer to column (a) above). The total number ofnests in the breeding area for all types of habitat is of interest. Column (a) in the above table is thedata that will be used. It represents the total number of nests in the 10 transects of each set.

The principle behind the estimator is that the 1000 total transects can be divided into 100 sets of10 transects, of which a random sample of size 10 was chosen. The sampling unit is the set oftransects – the individual transects are essentially ignored.

Note that this method assumes that the systematic samples are all of the same size. If the systematicsamples had been of different sizes (e.g. some sets had 15 transects, other sets had 5 transects),then a ratio-estimator (see later sections) would have been a better estimator.

• compute the total number of nests for each set. This is found in column (a).

• Then the sets selected are treated as a SRSWOR sample of size 10 from the 100 possiblesets. An estimate of the mean number of nests per set of 10 transects is found as: µ =

(468 + 93 + · · · + 197)/10 = 165.7 with an estimated se of se(µ) =√

s2

n

(1− n

100

)=√

117.02

10

(1− 10

100

)= 35.1

• The average number of nests per set is expanded to cover all 100 sets τ = 100µ = 16570and se(τ) = 100se(µ) = 3510

2. Total number of nests in the prime habitat only (refer to column (b) above). This is formedin exactly the same way as the previous estimate. This is technically known as estimation in a

c©2019 Carl James Schwarz 139 2019-11-03

Page 42: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

domain. The number of elements in the domain in the whole population (i.e. how many of the1000 transects are in prime-habitat) is unknown but is not needed. All that you need is the totalnumber of nests in prime habitat in each set – you essentially ignore the non-prime habitat transectswithin each set.

The average number of nests per set in prime habitats is found as before: µ = 123+···+9310 =

71.3 with an estimated se of se(µ) =√

s2

n (1− n100 ) =

√29.52

10 (1− 10100 ) = 8.85.

• because there are 100 sets of transects in total, the estimate of the population total number ofnests in prime habitat and its estimated se is τ = 100µ = 7130 with a se(τ) = 100se(µ) =885

• Note that the total number of transects of prime habitat is not known for the population andso an estimate of the density of nests in prime habitat cannot be computed from this estimatedtotal. However, a ratio-estimator (see later in the notes) could be used to estimate the density.

3. Difference in mean density between prime and non-prime habitats The scientists suspect thatthe density of nests is higher in prime habitat than in non-prime habitat. Is there evidence of thisin the data? (refer to columns (c)-(e) above). Here everything must be transformed to the densityof nest per transect (assuming that the transects were all the same size). Also, pairing (refer tothe section on experimental design) is taking place so a difference must be computed for eachset and the differences analyzed, rather than trying to treat the prime and non-prime habitats asindependent samples.

Again, this is an example of what is known as domain-estimation.

• Compute the domain means for type of habitat for each set (columns (c) and (d)). Note thatthe totals are divided by the number of transects of each type in each set.

• Compute the difference in the means for each set (column (e))

• Treat this difference as a simple random sample of size 10 taken from the 100 possible setsof transects. What does the final estimated mean difference and se imply?

3.7 Stratified simple random sampling

A simple modification to a simple random sample can often lead to dramatic improvements in precision.This is known as stratification. All survey methods can potentially benefit from stratification (alsoknown as blocking in the experimental design literature).

Stratification will be beneficial whenever variability in the response variable among the survey unitscan be anticipated and strata can be formed that are more homogeneous than the original set of surveyunits.

All stratified designs will have the same basic steps as listed below regardless of the underlyingdesign.

• Creation of strata. Stratification begins by grouping the survey units into homogeneous groups(strata) where survey units within strata should be similar and strata should be different. Forexample, suppose you wished to estimate the density of animals. The survey region is divided intoa large number of quadrats based on aerial photographs. The quadrats can be stratified into highand low quality habitat because it is thought that the density within the high quality quadrats maybe similar but different from the density in the low quality habitats. The strata do not have to bephysically contiguous – for example, the high quality habitats could be scattered through out thesurvey region and can be grouped into one single stratum.

c©2019 Carl James Schwarz 140 2019-11-03

Page 43: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• Determine total sample size. Use the methods in previous sections to determine the total samplesize (number of survey units) to select. At this stage, some sort of “average” standard deviationwill be used to determine the sample size.

• Allocate effort among the strata. there are several ways to allocate the total effort among thestrata.

– Equal allocation. In equal allocation, the total effort is split equally among all strata. Equalallocation is preferred when equally precise estimates are required for each stratum. 2

– Proportional allocation. In proportional allocation, the total effort is allocated to the stratain proportion to stratum importance. Stratum importance could be related to stratum size(e.g. when allocating effort among the U.S. and Canada, then because the U.S. is 10 timeslarger in Canada, more effort should be allocated to surveying the U.S.). But if density is yourmeasure of importance, allocate more effort to higher density strata. Proportional allocationis preferred when more precise estimates are required in more important strata.

– Neyman allocation. Neyman determined that if you also have information on the variabilitywithin each stratum, then more effort should be allocated to strata that are more importantand more variable to give you the most precise overall estimate for a given sample size.This rarely is performed in ecology because often information on intra-stratum variability isunknown. 3

– Cost allocation. In general, effort should be allocated to more important strata, more variablestrata, or strata where sampling is cheaper to give the best overall precision for the entiresurvey. As in the previous allocation method, ecologists rarely have sufficiently detailed costinformation to do this allocation method.

• Conduct separate surveys in each stratum Separate independent surveys are conducted in eachstratum. It is not necessary to use the same survey method in all strata. For example, low densityquadrats could be surveyed using aerial methods, while high density strata may require groundbased methods. Some strata may use simple random samples, while other strata may use clustersamples. Many textbooks show examples were the same survey method is used in all strata, butthis is NOT required.

The ability to use different sampling methods in the different strata often leads to substantialcost savings and is a very good reason to use stratified sampling!

• Obtain stratum specific estimates. Use the appropriate estimators to estimate stratum means andthe se for EACH stratum. Then expand the estimated mean to get the estimated total (and se) inthe usual way.

• Rollup The individual stratum estimates of the TOTAL are then combined to give an overall GrandTotal value for the entire survey region. The se of the Grand Total is found as:

se(GT ) =√

se(τ1)2 + se(τ2)2 + . . .

Finally, if you want the overall grand average, simply divide the grand total (and its se) by theappropriate divisor.

Stratification is normally carried out prior to the survey (pre- stratification), but can also be done afterthe survey (post-stratification) – refer to a later section for details. Stratification can be used with anytype of sampling design – the concepts introduced here deal with stratification applied to simple randomsamples but are easily extended to more complex designs.

The advantages of stratification are:2Recall from previous sections that the absolute sample size is one of the drivers for precision.3However, in many cases, higher means per survey unit are accompanied by greater variances among survey units so allocations

based on stratum means often capture this variation as well.

c©2019 Carl James Schwarz 141 2019-11-03

Page 44: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• standard errors of the mean or of the total will be smaller (i.e. estimates are more precise) whencompared to the respective standard errors from an unstratified design if the units within strataare more homogenous (i.e., less variable) compared to the variability over the entire unstratifiedpopulation.

• different sampling methods may be used in each stratum for cost or convenience reasons. [In thedetail below we assume that each stratum has the same sampling method used, but this is only forsimplification.] This can often lead to reductions in cost as the most appropriate and cost effectivesampling method can be used in each straum.

• because randomization occurs independently in each stratum, corruption of the survey design dueto problems experienced in the field may be confined.

• separate estimates for each stratum with a given precision can be obtained

• it may be more convenient to take a stratified random sample for administrative reasons. Forexample, the strata may refer to different district offices.

3.7.1 A visual comparison of a simple random sample vs. a stratified simple ran-dom sample

You may find it useful to compare a simple random sample of 24 vs. a stratified random sample of 24using the following visual plans:

Select a sample of 24 in each case.

c©2019 Carl James Schwarz 142 2019-11-03

Page 45: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Simple Random Sampling

Describe how the sample was taken.

c©2019 Carl James Schwarz 143 2019-11-03

Page 46: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Stratified Simple Random Sampling

Suppose that there is a gradient in habitat quality across the population. Then a more efficient (i.e.leading to smaller standard errors) sampling design is a stratified design.

Three strata are defined, consisting of the first 3 rows, the next 5 rows, and finally, the last two rows.In many cases, the sample sample design is used in all strata. For example, suppose it was decided toconduct a simple random sample within each stratum, with sample sizes of 8, 10, and 6 in the three stratarespectively. [The decision process on allocating samples to strata will be covered later.]

c©2019 Carl James Schwarz 144 2019-11-03

Page 47: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

c©2019 Carl James Schwarz 145 2019-11-03

Page 48: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Stratified Sampling with a different method in each stratum

It is quite possible, and often desirable, to use different methods in the different strata. For example, itmay be more efficient to survey desert areas using a fixed-wing aircraft, while ground surveys need to beused in heavily forested areas.

For example, consider the following design. In the first (top most) stratum, a simple random samplewas taken; in the second stratum a cluster sample was taken; in the third stratum a cluster sample (viatransects) was also taken.

c©2019 Carl James Schwarz 146 2019-11-03

Page 49: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

c©2019 Carl James Schwarz 147 2019-11-03

Page 50: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.7.2 Notation

Common notation is to use h as a stratum index and i or j as unit indices within each stratum.

Characteristic Population quantities sample quantities

number of strata H H

stratum sizes N1, N2, · · · , NH n1, n2, · · · , nHpopulation units Yhj h=1,· · · ,H, j=1,· · · ,NH yhj h=1,· · · ,H, j=1,· · · ,nHstratum totals τh yh

stratum means µh yh

Population total τ = NH∑h=1

Whµh where Wh = NhN

Population mean µ =H∑h=1

Whµh

Variance S2h s2

h

Standard deviation Sh sh

3.7.3 Summary of main results

It is assumed that from each stratum, a SRSWOR of size nh is selected independently of ALL OTHERSTRATA!

The results below summarize the computations that can be more easily thought as occurring in foursteps:

1. Compute the estimated mean and its se for each stratum. In this chapter, we use a SRS design ineach stratum, but it not necessary to use this design in a stratum and each stratum could have adifferent design. In the case of an SRS, the estimate of the mean for each stratum is found as:

µh = yh

with associated standard error:

se(µh) =

√s2h

nh(1− fh)

where the subscript h refers to each stratum.

2. Compute the estimated total and its se for each stratum. In many cases this is simply the estimatedmean for the stratum multiplied by the STRATUM POPULATION size. In the case of an SRS ineach stratum this gives::

τh = Nh × µh = Nh × yh.

se(τh) = Nh × se(µh) = Nh ×

√s2h

nh(1− fh)

3. Compute the grand total and its se over all strata. This is the sum of the individual totals. The seis computed in a special way.

τ = τ1 + τ2 + . . .

se(τ) =√se(τ1)2 + se(τ2)2 + . . .

c©2019 Carl James Schwarz 148 2019-11-03

Page 51: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

4. Occasionally, the grand mean over all strata is needed. This is found by dividing the estimatedgrand total by the total POPULATION sizes:

µ =τ

N1 +N2 + . . .

se(µ) =se(τ)

N1 +N2 + . . .

This can be summarized in a succinct form as follows. Note that the stratum weights Wh are formedas Nh/N and are often used to derive weighted means etc:

Quantity Pop value Estimator se

Mean µ =H∑h=1

Whµh µstr =H∑h=1

Whyh

√H∑h=1

W 2h se2(yh) =√

H∑h=1

W 2hs2hnh

(1− fh)

Total τ = NH∑h=1

Whµh or τstr = NH∑h=1

Whyh or

√H∑h=1

N2hse2(yh) or

τ =H∑h=1

τh or τstr =H∑h=1

Nhyh

√H∑h=1

N2hs2hnh

(1− fh)

τ =H∑h=1

Nhµh

Notes

• The estimator for the grand population mean is a weighted average of the individual stratum meansusing the POPULATION weights rather than the sample weights. This is NOT the same as thesimple unweighted average of the estimated stratum means unless the nh/n equal the Nh/N -such a design is known as proportional allocation in stratified sampling.

• The estimated standard error for the grand total is found as√

se21 + se2

2 + · · ·+ se2h, i.e. the square

root of the sum of the individual se2 of the strata TOTALS.

• The estimators for a proportion are IDENTICAL to that of the mean except replace the variable ofinterest by 0/1 where 1=character of interest and 0=character not of interest.

• Confidence intervals Once the se has been determined, the usual ±2se will give approximate95% confidence intervals if the sample sizes are relatively large in each stratum. If the samplesizes are small in each stratum some authors suggest using a t-distribution with degrees of freedomdetermined using a Satterthwaite approximation - this will not be covered in this course.

3.7.4 Example - sampling organic matter from a lake

[With thanks to Dr. Rick Routledge for this example].

Suppose that you were asked to estimate the total amount of organic matter suspended in a lake justafter a storm. The first scheme that might occur to you could be to cruise around the lake in a haphazardfashion and collect a few sample vials of water which you could then take back to the lab. If you knew

c©2019 Carl James Schwarz 149 2019-11-03

Page 52: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

the total volume of water in the lake, then you could obtain an estimate of the total amount of organicmatter by taking the product of the average concentration in your sample and the total volume of thelake.

The accuracy of your estimate of course depends critically on the extent to which your sample isrepresentative of the entire lake. If you used the haphazard scheme outlined above, you have no wayof objectively evaluating the accuracy of the sample. It would be more sensible to take a properlyrandomized sample. (How might you go about doing this?)

Nonetheless, taking a randomized sample from the entire lake would still not be a totally sensibleapproach to the problem. Suppose that the lake were to be fed by a single stream, and that most of theorganic matter were concentrated close to the mouth of the stream. If the sample were indeed repre-sentative, then most of the vials would contain relatively low concentrations of organic matter, whereasthe few taken from around the mouth of the stream would contain much higher concentration levels.That is, there is a real potential for outliers in the sample. Hence, confidence limits based on the normaldistribution would not be trustworthy.

Furthermore, the sample mean is not as reliable as it might be. Its value will depend critically on thenumber of vials sampled from the region close to the stream mouth. This source of variation ought to becontrolled.

Finally, it might be useful to estimate not just the total amount of organic matter in the entire lake,but the extent to which this total is concentrated near the mouth of the stream.

You can simultaneously overcome all three deficiencies by taking what is called a stratified randomsample. This involves dividing the lake into two or more parts called strata. (These are not the horizontalstrata that naturally form in most lakes, although these natural strata might be used in a more complexsampling scheme than the one considered here.) In this instance, the lake could be divided into two parts,one consisting roughly of the area of high concentration close to the stream outlet, the other comprisingthe remainder of the lake.

Then if a simple random sample of fixed size were to be taken from within each of these “strata”, theresults could be used to estimate the total amount of organic matter within each stratum. These subtotalscould then be added to produce an estimate of the overall total for the lake.

This procedure, because it involves constructing separate estimates for each stratum, permits us toassess the extent to which the organic matter is concentrated near the stream mouth. It also permits theinvestigator to control the number of vials sampled from each of the two parts of the lake. Hence, thechance variation in the estimated total ought to be sharply reduced. Finally, we shall soon see that theconfidence limits that one can construct are free of the outlier problem that invalidated the confidencelimits based on a simple random sampling scheme.

A randomized sample is to be drawn independently from within each stratum.

How can we use the results of a stratified random sample to estimate the overall total? The simplestway is to construct an estimate of the totals within each of the strata, and then to sum these estimates.A sensible estimate of the average within the h’th stratum is yh. Hence, a sensible estimate of the totalwithin the h’th stratum is τh = Nhyh, and the overall total can be estimated by τ =

∑Hh=1 τh =∑H

h=1Nhyh.

If we prefer to estimate the overall average, we can merely divide the estimate of the overall total bythe size of the population, N . The resulting estimator is called the stratified random sampling estimatorof the population average, and is given by µ =

∑Hh=1Nhyh/N .

c©2019 Carl James Schwarz 150 2019-11-03

Page 53: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

This can be expressed as a fancy average if we adjust the order of operations in the above expression.If, instead of dividing the sum by N , we divide each term by N and then sum the results, we shall obtainthe same result. Hence,

µstratified =

H∑h=1

(Nh/N)yh

=

H∑h=1

Whyh,

where Wh = Nh/N . These Wh-values can be thought of as weighting factors, and µstratified canthen be viewed as a weighted average of the within-stratum sample averages.

The estimated standard error is found as:

se(µstratified) = se

{H∑h=1

Whyh

}

=

√√√√ H∑h=1

W 2h [se(yh)]2,

where the estimated se(yh) is given by the formulas for simple random sampling: se(yh) =√

s2hnh

(1− fh).

A Numerical Example

Suppose that for the lake sampling example discussed earlier the lake were subdivided into two strata,and that the following results were obtained. (All readings are in mg per litre.)

Stratum Nh nh Sample Observations yh sh

1 7.5× 108 5 37.2 46.6 45.3 38.1 40.4 41.52 4.23

2 2.5× 107 5 365 344 388 347 403 369.4 25.7

We begin by computing the estimated mean for each stratum and its associated standard error. Thesampling fraction nh

Nhis so close to 0 it can be safely ignored. For example, the standard error of the

mean for stratum 1 is found as:

se(µ1) =

√s2

1

n1(1− f1) =

√4.232

5= 1.89

. This gives the summary table:

Stratum nh µh se(µh)

1 5 41.52 1.8935

2 5 369.4 11.492

Next, we estimate the total organic matter in each stratum. This is found by multiplying the meanconcentration and se of each stratum by the total volume:

τh = Nh × µh

se(τh) = Nhse(µh)

c©2019 Carl James Schwarz 151 2019-11-03

Page 54: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

For example, the estimated total organic matter in stratum 1 is found as:

τ1 = N1 × µ1 = 7.5× 108 × 41.52 = 311.4× 108

se(τ1) = N1se(µ1) = 7.5× 108 × 1.89 = 14.175× 108

This gives the summary table:

Stratum nh µh se(µh) τh se(τh)

1 5 41.52 1.8935 311.4 ×108 14.175 ×108

2 5 369.4 11.492 92.3 ×108 2.873 ×108

Next, we total the organic content of the two strata and find the se of the grand total as√

14.1752 + 2.8732×108 to give the summary table:

Stratum nh µh se(µh) τh se(τh)

1 5 41.52 1.8935 311.4 ×108 14.175 ×108

2 5 369.4 11.492 92.3 ×108 2.873 ×108

Total 403.7 ×108 14.46 ×108

Finally, the overall grand mean is found by dividing by the total volume of the lake 7.75 × 108 togive:

µ =403.7× 108

7.75× 108= 52.09mg/L

se(µ) =14.46× 108

7.75× 108= 1.87mg/L

The calculations required to compute the stratified estimate can also be done using the method ofweighted averages as shown in the following table:

Stratum Nh Wh yh Whyh se(yh) W 2h [se(yh)]2

(= Nh/N)

1 7.5× 108 0.9677 41.52 40.180 1.8935 3.3578

2 2.5× 107 0.0323 369.4 11.916 11.492 0.1374

Totals 7.75× 108 1.0000 52.097 3.4952

se =√

3.4952

Hence the estimate of the overall average is 52.097 mg/L, and the associated estimated standard erroris√

3.4963 = 1.870 mg/L and an approximate 95% confidence interval is then found in the usual fashion.As expected these match the previous results.

This discussion swept a number of practical difficulties under the carpet. These include (a) estimatingthe volume of each of the two portions of the lake, (b) taking properly randomized samples from withineach stratum, (c) selecting the appropriate size of each water sample, (d) measuring the concentration foreach water sample, and (e) choosing the appropriate number of water samples from each stratum. Noneof these difficulties is simple to do. Estimating the volume of a portion of a lake, for example, typicallyinvolves taking numerous depth readings and then applying a formula for approximating integrals. Thisproblem is beyond the scope of these notes.

c©2019 Carl James Schwarz 152 2019-11-03

Page 55: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The standard error in the estimator of the overall average is markedly reduced in this example by thestratification. The standard error was just estimated for the stratified estimator to be around 2. This resultwas for a sample of total size 10. By contrast, for an estimator based on a simple random sample of thesame size, the standard error can be found to be about 20. [This involves methods not covered in thisclass.] Stratification has reduced the standard error by an order of magnitude.

It is also possible that we could reduce the standard error even further without increasing our sam-pling effort by somehow allocating this effort more efficiently. Perhaps we should take fewer watersamples from the region far from the outlet, and take more from the other stratum. This will be coveredlater in this course.

One can also read in more comprehensive accounts how to construct estimates from samples that arestratified after the sample is selected. This is known as post-stratification. These methods are useful if,e.g., you are sampling a population with a known sex ratio. If you observe that your sample is biased infavor of one sex, you can use this information to build an improved estimate of the quantity of interestthrough stratifying the sample by sex after it is collected. It is not necessary that you start out with a planfor sampling some specified number of individuals from each sex (stratum).

Nonetheless, in any survey work, it is crucial that you begin with a plan. There are many examples ofsurveys that produced virtually useless results because the researchers failed to develop an appropriateplan. This should include a statement of your main objective, and detailed descriptions of how you planto generate the sample, collect the data, enter them into a computer file, and analyze the results. The planshould contain discussion of how you propose to check for and correct errors at each stage. It shouldbe tested with a pilot survey, and modified accordingly. Major, ongoing surveys should be reassessedcontinually for possible improvements. There is no reason to expect that the survey design will be perfectthe first time that it is tried, nor that flaws will all be discovered in the first round. On the other hand, oneshould expect that after many years experience, the researchers will have honed the survey into a solidinstrument. George Gallup’s early surveys were seriously biased. Although it took over a decade for theflaws to come to light, once they did, he corrected his survey design promptly, and continued to build astrong reputation.

One should also be cautious in implementing stratified survey designs for long-term studies. Anefficient stratification of the Fraser Delta in 1994, e.g., might be hopelessly out of date 50 years fromnow, with a substantially altered configuration of channels and islands. You should anticipate the needto revise your stratification periodically.

3.7.5 Example - estimating the total catch of salmon

DFO needs to monitor the catch of sockeye salmon as the season progresses so that stocks are notoverfished.

The season in one statistical sub-area in a year was a total of 2 days (!) and 250 vessels participatedin the fishery in these 2 days. A census of the catch of each vessel at the end of each day is logisticallydifficult.

In this particular year, observers were randomly placed on selected vessels and at the end of each daythe observers contacted DFO managers with a count of the number of sockeye caught on that day.

Here is the raw data - each line corresponds to the observers’ count for that vessel for that day. Onthe second day, a new random sample of vessels was selected. On both days, 250 vessels participated inthe fishery.

c©2019 Carl James Schwarz 153 2019-11-03

Page 56: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Date Sockeye29-Jul-98 33729-Jul-98 73029-Jul-98 45829-Jul-98 9829-Jul-98 8229-Jul-98 2829-Jul-98 54429-Jul-98 41529-Jul-98 28529-Jul-98 23529-Jul-98 57129-Jul-98 22529-Jul-98 1929-Jul-98 62329-Jul-98 180

30-Jul-98 9730-Jul-98 31130-Jul-98 4530-Jul-98 5830-Jul-98 3330-Jul-98 20030-Jul-98 38930-Jul-98 33030-Jul-98 22530-Jul-98 18230-Jul-98 27030-Jul-98 13830-Jul-98 8630-Jul-98 49630-Jul-98 215

What is the population of interest?

The population of interest is the set of vessels participating in the fishery on the two days. [The fact thateach vessel likely participated in both days is not really relevant.] The population of interest is NOT thesalmon captured - this is the response variable for each boat whose total is of interest.

What is the sampling frame?

It is not clear how the list of fishing boats was generated. It seems unlikely that the aerial survey actuallyhad a picture of the boats on the water from which DFO selected some boats. More likely, the observerswere taken onto the water in some systematic fashion, and then the observer selected a boat at randomfrom those seen at this point. Hence the sampling frame is the set of locations chosen to drop off theobservers and the set of boats visible from these points.

c©2019 Carl James Schwarz 154 2019-11-03

Page 57: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

What is the sampling design?

The sampling unit is a boat on a day. The strata are the two days. On each day, a random sample wasselected from the boats participating in the fishery.

This is a stratified design with a simple random sample selected each day.

Note in this survey, it is logistically impossible to do a simple random sample over both the daysas the number of vessels participating really isn’t known for any day until the fishery starts. Here,stratification takes the form of administrative convenience.

Excel analysis

A copy of an Excel spreadsheet is available in the sockeye tab of the AllofData workbook available fromthe Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

A summary of the page appears below:

c©2019 Carl James Schwarz 155 2019-11-03

Page 58: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Number of sockeye caught - example of stratified simple random sampling14:43 Saturday, July 4, 2015 1

raw data from the surveyNumber of sockeye caught - example of stratified simple random sampling

14:43 Saturday, July 4, 2015 1

raw data from the survey

Obs date sockeye sampweight

1 29-Jul 337 16.6667

2 29-Jul 730 16.6667

3 29-Jul 458 16.6667

4 29-Jul 98 16.6667

5 29-Jul 82 16.6667

6 29-Jul 28 16.6667

7 29-Jul 544 16.6667

8 29-Jul 415 16.6667

9 29-Jul 285 16.6667

10 29-Jul 235 16.6667

11 29-Jul 571 16.6667

12 29-Jul 225 16.6667

13 29-Jul 19 16.6667

14 29-Jul 623 16.6667

15 29-Jul 180 16.6667

16 30-Jul 97 16.6667

17 30-Jul 311 16.6667

18 30-Jul 45 16.6667

19 30-Jul 58 16.6667

20 30-Jul 33 16.6667

21 30-Jul 200 16.6667

22 30-Jul 389 16.6667

23 30-Jul 330 16.6667

24 30-Jul 225 16.6667

25 30-Jul 182 16.6667

26 30-Jul 270 16.6667

27 30-Jul 138 16.6667

28 30-Jul 86 16.6667

29 30-Jul 496 16.6667

30 30-Jul 215 16.6667

The data are listed on the spreadsheet on the left.

Summary statistics

The Excel builtin functions are used to compute the summary statistics (sample size, sample mean, andsample standard deviation) for each stratum. Some caution needs to be exercised that the range of eachfunction covers only the data for that stratum. 4

4If you are proficient with Excel, Pivot-Tables are an ideal way to compute the summary statistics for each stratum. An

c©2019 Carl James Schwarz 156 2019-11-03

Page 59: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

You will also need to specify the stratum size (the total number of sampling units in each stratum),i.e. 250 vessels on each day.

Find estimates of the mean catch for each stratum

Because the sampling design in each stratum is a simple random sample, the same formulae as in theprevious section can be used.

The mean and its estimated se for each day of the opening is reported in the spreadsheet.

Find the estimates of the total catch for each stratum

The estimated total catch is found by multiplying the average catch per boat by the total number of boatsparticipating in the fishery. The estimated standard error for the total for that day is found by multiplyingthe standard error for the mean by the stratum size as in the previous section.

For example, in the first stratum (29 July), the estimated total catch is found by multiplying theestimated mean catch per boat (322) by the number of boats participating (250) to give an estimated totalcatch of 80,500 salmon for the day. The se for the total catch is found by multiplying the se of the mean(57) by the number of boats participating (250) to give the se of the total catch for the day of 14,200salmon.

Find estimate of grand total

Once an estimated total is found for each stratum, the estimated grand total is found by summing theindividual stratum estimated totals. The estimated standard error of the grand total is found by the squareroot of the sum of the squares of the standard errors in each stratum - the Excel function sumsq is usefulfor this computation.

Estimates of the overall grand mean

This was not done in the spreadsheet, but is easily computed by dividing the total catch by the totalnumber of boat days in the fishery (250+250=500). The se is found by dividing the se of the total catchalso by 500.

Note this is interpreted as the mean number of fish captured per day per boat.

SAS analysis

As noted earlier, some care must be used when standard statistical packages are used to analyze surveydata as many packages ignore the design used to select the data.

A sample SAS program for the analysis of the sockeye example called sockeye.sas and its outputcalled sockeye.lst is available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The program starts with reading in the raw data and the computation of the sampling weights.

data sockeye; /* read in the data */infile ’sockeye.csv’ dlm=’,’ dsd missover firstobs=2;

application of Pivot-Tables is demonstrated in the analysis of a cluster sample where the cluster totals are needed for the summarystatistics.

c©2019 Carl James Schwarz 157 2019-11-03

Page 60: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

length date $8.;input date $ sockeye;/* compute the sampling weight. In general,

these will be different for each stratum */if date = ’29-Jul’ then sampweight = 250/15;if date = ’30-Jul’ then sampweight = 250/15;

Because the population size and sample size are the same for each stratum, the sampling weights arecommon to all boats. In general, this is not true, and a separate sampling weight computation is requiredfor each stratum.

A separate file is also constructed with the population sizes for each stratum so that estimates of thepopulation total can be constructed.

data n_boats; /* you need to specify the stratum sizes if you want stratum totals */length date $8.;date = ’29-Jul’; _total_=250; output; /* the stratum sizes must be variable _total_ */date = ’30-Jul’; _total_=250; output;

run;

Proc SurveyMeans then uses the STRATUM statement to identify that this is a stratified design. Thedefault design and analysis for each stratum is again a simple random sample. It is not necessary thata simple random sample be done in each stratum, nor that the same design be used in each stratum - inthese cases the analysis will be more complex.

proc surveymeans data=sockeyeN = n_boats /* dataset with the stratum population sizes present */mean clm /* average catch/boat along with standard error */sum clsum ; /* request estimates of total */ ;

strata date / list; /* identify the stratum variable */var sockeye; /* which variable to get estimates for */weight sampweight;ods output stratainfo=stratainfo;ods output statistics=sockeyeresults;

run;

The SAS output consists of two parts. First is information about each stratum:

StratumIndex date

NumberofObservationsinaStratum

PopulationTotals

SamplingRate

VariableName N

1 29-Jul 15 250 6.00% sockeye 15

2 30-Jul 15 250 6.00% sockeye 15

c©2019 Carl James Schwarz 158 2019-11-03

Page 61: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Then are the actual results

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

sockeye 263.5 33.1 195.7 331.3 131750 16541 97867 165633

which match the results from Excel (as they must).

The only thing of “interest” is to note that by default, SAS labels the precision of the estimated grandmeans as a Standard error while it labels the precision of the estimated total as a standard deviation!Both are correct - a standard error is a standard deviation - not of individual units in the population - butof the estimates over repeated sampling from the same population. I think it is clearer to label both asstandard errors to avoid any confusion, as I did using the label statements in the Proc Print that createdthe output (see code for details).

If separate analyses are wanted for each stratum, the SurveyMeans procedure has to be run twice, onetime with a BY statement to estimate the means and totals in each stratum.

proc surveymeans data=sockeyeN = n_boats /* dataset with the stratum population sizes present */mean clm /* average catch/boat along with standard error */sum clsum ; /* request estimates of total */ ;by date;var sockeye; /* which variable to get estimates for */weight sampweight;ods output stratainfo=stratainfosep;ods output statistics=sockeyeresultssep;

run;

This gives separate results for each stratum

dateVariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

29-Jul sockeye 322.0 56.8 200.2 443.8 80500 14195 50055 110945

30-Jul sockeye 205.0 34.0 132.1 277.9 51250 8492 33036 69464

Again, it is likely easiest to do planning for future experiments in an Excel spreadsheet rather thanusing SAS.

When should the various estimates be used?

In a stratified sample, there are many estimates that are obtained with different standard errors. It cansometimes be confusion as to which estimate is used for which purpose.

Here is a brief review of the four possible estimates and the level of interest in each estimate.

c©2019 Carl James Schwarz 159 2019-11-03

Page 62: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CH

APT

ER

3.SA

MPL

ING

Parameter Estimator se Example and Interpretation Who would be interested in this quan-tity?

Stratummean

µh = Y h

√s2hnh

(1− fh) Stratum 1. Estimate is 322; se of56.8 (not shown).The estimated average catch perboat was 322 fish (se 56.8 fish)on 29 July

A fisher who wishes to fish ONLY thefirst day of the season and wants toknow if it will meet expenses.

Stratumtotal

τh = Nhµh =NhY h

Nhse(µh) =

Nh

√s2jnh

(1− fh)

Stratum 1. Estimate is80,500=250x322; se of14195=250x56.8.

The estimated total catch overall boatson 29 July was 80,500 (se 14,195)DFO who wishes to estimate TOTALcatch overall ALL boats on this singleday so that quota for next day can beset. Grand Total

Grandtotal.

τ = τ1 + τ2√se(τ1)2 + se(τ1)2 Estimate

131,750=80,500+51,250; se is√1419522 + 849222 = 16541.

The estimated total catch overallall boats over all days is 132,000fish (se 17,000 fish).

DFO who wishes to know total catchover entire fishing season so that im-pacts on stock can be examined.

Grandaverage

µ = τN

se(τ)N Grand mean (not shown).

N=500 vessel-days.Estimate is 131,750/500=263.5;se is 16541/500=33.0.The estimated catch per boat perday over the entire season was263 fish (se 33 fish).

A fisher who want to know averagecatch per boat per day for the entireseason to see if it will meet expenses.

c©2019

CarlJam

esSchw

arz160

2019-11-03

Page 63: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.7.6 Sample Size for Stratified Designs

As before, the question arises as how many units should be selected in stratified designs.

This has two questions that need to be answered. First, what is the total sample size required? Secondhow should these be allocated among the strata.

The total sample size can be determined using the same methods as for a simple random sample. Iwould suggest that you initially ignore the fact that the design will be stratified when finding the initialrequired total sample size. If stratification proves to be useful, then your final estimate will be moreprecise than you anticipated (always a nice thing to happen!) but seeing as you are making guesses asto the standard deviations and necessary precision required, I wouldn’t worry about the extra cost insampling too much.

If you must, it is possible to derive formulae for the overall sample sizes when accounting for stratifi-cation, but these are relatively complex. It is likely easier to build a general spreadsheet where the singlecell is the total sample size and all other cells in the formula depend upon this quantity depending uponthe allocation used. Then the total sample size can be manipulated to obtain the desired precision. Thefollowing information will be required:

• The sizes (or relative sizes) of each stratum (i.e. the Nh or Wh).

• The standard deviation of measurements in each stratum. This can be obtained from past surveys,a literature search, or expert opinion.

• The desired precision – overall – and if needed, for each stratum.

Again refer to the sockeye worksheet.

c©2019 Carl James Schwarz 161 2019-11-03

Page 64: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Number of sockeye caught - example of stratified simple random sampling14:43 Saturday, July 4, 2015 1

raw data from the surveyNumber of sockeye caught - example of stratified simple random sampling

14:43 Saturday, July 4, 2015 1

raw data from the survey

Obs date sockeye sampweight

1 29-Jul 337 16.6667

2 29-Jul 730 16.6667

3 29-Jul 458 16.6667

4 29-Jul 98 16.6667

5 29-Jul 82 16.6667

6 29-Jul 28 16.6667

7 29-Jul 544 16.6667

8 29-Jul 415 16.6667

9 29-Jul 285 16.6667

10 29-Jul 235 16.6667

11 29-Jul 571 16.6667

12 29-Jul 225 16.6667

13 29-Jul 19 16.6667

14 29-Jul 623 16.6667

15 29-Jul 180 16.6667

16 30-Jul 97 16.6667

17 30-Jul 311 16.6667

18 30-Jul 45 16.6667

19 30-Jul 58 16.6667

20 30-Jul 33 16.6667

21 30-Jul 200 16.6667

22 30-Jul 389 16.6667

23 30-Jul 330 16.6667

24 30-Jul 225 16.6667

25 30-Jul 182 16.6667

26 30-Jul 270 16.6667

27 30-Jul 138 16.6667

28 30-Jul 86 16.6667

29 30-Jul 496 16.6667

30 30-Jul 215 16.6667

The standard deviations from this survey will be used as ‘guesses’ for what might happen next year.As in this year’s survey, the total sample size will be allocated evenly between the two days.

In this case, the total sample size must be allocated to the two strata. You will see several methods ina later section to do this, but for now, assume that the total sample will be allocated equally among bothstrata. Hence the proposed sample size of 75 is split in half to give a proposed sample size of 37.5 in eachstratum. Don’t worry about the fractional sample size - this is only a planning exercise. We create onecell that has the total sample size, and then use the formulae to allocate the total sample size equally to the

c©2019 Carl James Schwarz 162 2019-11-03

Page 65: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

two strata. The total and the se of the overall total are found as before, and the relative precision (denotedas the relative standard error (rse), and, unfortunately, in some books at the coefficient of variation cv )is found as the estimated standard error/estimated total.

Again, this portion of the spreadsheet is setup so that changes in the total sample size are propagatedthroughout the sheet. If you change the total sample size from 75 to some other number, this is auto-matically split among the two strata, which then affects the estimated standard error for each stratum,which then affects the estimated standard error for the total, which then affects the relative standard error.Again, the proposed total sample size can be varied using trial and error, or the Excel Goal-Seek optioncan be used.

Here is what happens when a sample size of 75 is used. Don’t be alarmed by the fractional samplesizes in each stratum – the goal is again to get a rough feel for the required effort for a certain precision.

Total n=75

se

Est Est

Stratum n Mean std dev vessels total total

29-Jul 37.5 322 226.8 250 80500 8537

30-Jul 37.5 205 135.7 250 51250 5107

Total 131750 9948

rse 7.6%

A sample size of 75 is too small. Try increasing the sample size until the rse is 5% or less. Alter-natively, once could use the GOAL SEEK feature of Excel to find the sample size that gives a relativestandard error of 5% or less as shown below:

Total n=145

se

Est Est

Stratum n Mean std dev vessels total total

29-Jul 72.5 322 226.8 250 80500 5611

30-Jul 72.5 205 135.7 250 51250 3357

Total 131750 6539

rse 5.0%

3.7.7 Allocating samples among strata

There are number of ways of allocating a sample of size n among the various strata. For example,

1. Equal allocation. Under an equal allocation scheme, all strata get the same sample size, i.e.nh = n/H This allocation is best if variances of strata are roughly equal, equally precise estimatesare required for each stratum, and you wish to test for differences in means among strata (i.e. ananalytical survey discussed in previous sections).

2. Proportional allocation. Under proportional allocation, sample sizes are allocated to be pro-portional to the number of sampling units in the strata, i.e ni = n × Ni

N = n × Ni∑Nh

=

c©2019 Carl James Schwarz 163 2019-11-03

Page 66: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

n× NiN1+N2+···+NH = n×Wi This allocation is simple to plan and intuitively appealing. However,

it is not the best design. This design may waste effort because large strata get large sample sizesbut precision is determined by sample size not the ratio of sample size to population size. Forexample, if one stratum is 10 times larger than any other stratum, it is not necessary to allocate 10times the sampling effort to get the same precision in that stratum.

3. Neyman allocation In Neyman allocation (named after the statistician Neyman), the sample isallocated to minimize the overall standard error for a given total sample size. Tedious algebra givesthat the sample should be allocated proportional to the product of the stratum size and the stratumstandard deviation, i.e. ni = n × WiSi∑

WhSh= n × NiSi∑

NhSh= n × NiSi

N1S1+N2S2+···+NHSH . Thisallocation will be appropriate if the costs of measuring units are the same in all strata. Intuitively,the strata that have the most of sampling units should be weighted larger; strata with larger standarddeviations must have more samples allocated to them to get the se of the sample mean within thestratum down to a reasonable level. A key assumption of this allocation is that the cost to samplea unit is the same in all strata.

4. Optimal Allocation when costs are involved In some cases, the costs of sampling differ amongthe strata. Suppose that it costs Ci to sample each unit in a stratum i. Then the total cost of thesurvey is C =

∑nhCh. The allocation rule is that sample sizes should be proportional to the

product to stratum sizes, stratum standard deviations, and the inverse of the square root of the cost

of sampling, i.e. ni = n × WiSi/√Ci∑

(WhSh/√Ch)

= n ×NiSi√Ci∑

(NhSh√Ch

)= n ×

NiSi√Ci

N1S1√C1

+N2S2√C2

+···+NHSH√CH

This

implies that large samples are found in strata that are larger, more variable, or cheaper to sample.

In practice, most of the gain in precision occurs from moving from equal to proportional allocation,while often only small improvements in precision are gained from moving from proportional allocation toNeyman allocation. Similarly, unless cost differences are enormous, there isn’t much of an improvementin precision to moving to an allocation based on costs.

Example - estimating the size of a caribou herd This section is based on the paper:

Siniff, D.B. and Skoog, R.O. (1964).Aerial Censusing of Caribou Using Stratified Random Sampling.The Journal of Wildlife Management, 28, 391-401.http://dx.doi.org/10.2307/3798104

Some of the values have been modified slightly for illustration purposes.

The authors wished to estimate the size of a caribou herd. The density of caribou differs dramaticallybased on the habitat type. The survey area was was divided into six strata based on habitat type. Thesurvey design is to divide each stratum in 4 km2 quadrats that will be randomly selected. The numberof caribou in the quadrats will be counted from an aerial photograph.

The computations are available in the caribou tab in the Excel workbook ALLofData.xls available inSample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.The key point to examining different allocations is to make a single cell represent the total sample sizeand then make a formula in each of the stratum sample sizes a function of the total.

The total sample size can be found by varying the sample total until the desired precision is found.

Results from previous year’s survey: Here are the summary statistics from the survey in a previousyear:

c©2019 Carl James Schwarz 164 2019-11-03

Page 67: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Map-squares sampled

Stratum Nh nh y s Est total se(total)

1 400 98 24.1 74.7 9640 2621

2 40 10 25.6 63.7 1024 698

3 100 37 267.6 589.5 26760 7693

4 40 6 179 151.0 7160 2273

5 70 39 293.7 351.5 20559 2622

6 120 21 33.2 99.0 3984 2354

Total 770 211 69127 9172

The estimated size of the herd is 69,127 animals with an estimated se of 9,172 animals.

Equal allocation

What would happen if an equal allocation were used? We now split the 211 total sample size equallyamong the 6 strata. In this case, the sample sizes are ‘fractional’, but this is OK as we are interested onlyin planning to see what would have happened. Notice that the estimate of the overall population wouldNOT change, but the se changes.

Stratum Nh nh y s Est total se(total)

1 400 35.2 24.1 74.7 9640 4810

2 40 35.2 25.6 63.7 1024 149

3 100 35.2 267.6 589.5 26760 8005

4 40 35.2 179 151.0 7160 354

5 70 35.2 293.7 351.5 20559 2927

6 120 35.2 33.2 99.0 3984 1684

Total 770 211 69127 9938

An equal allocation gives rise to worse precision than the original survey. Examining the table in moredetail, you see that far too many samples are allocated in an equal allocation to strata 2 and 4 and notenough to strata 1 and 3.

Proportional allocation

What about proportional allocation? Now the sample size is proportional to the stratum population sizes.For example, the sample size for stratum 1 is found as 211×400/770. The following results are obtained:

Stratum Nh nh y s Est total se(total)

1 400 109.6 24.1 74.7 9640 2431

2 40 11.0 25.6 63.7 1024 656

3 100 27.4 267.6 589.5 26760 9596

4 40 11.0 179 151.0 7160 1554

5 70 19.2 293.7 351.5 20559 4787

6 120 32.9 33.2 99.0 3984 1765

Total 770 211 69127 11263

This has an even worse standard error! It looks like not enough samples are placed in stratum 3 or 5.

c©2019 Carl James Schwarz 165 2019-11-03

Page 68: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Optimal allocation

What if both the stratum sizes and the stratum variances are to be used in allocating the sample? We cre-ate a new column (at the extreme right) which is equal to NhSh. Now the sample sizes are proportionalto these values, i.e. the sample size for the first stratum is now found as 211× 29866.4/133893.8. Againthe estimate of the total doesn’t change but the se is reduced.

Stratum Nh nh y s Est total se(total) NhSh

1 400 47.1 24.1 74.7 9640 4089 29866.4

2 40 4.0 25.6 63.7 1024 1206 2550.0

3 100 92.9 267.6 589.56 26760 1629 58953.9

4 40 9.5 179 151.0 7160 1709 6039.6

5 70 38.8 293.7 351.5 20559 2639 24607.6

6 120 18.7 33.2 99.0 3984 2522 11876.4

Total 770 211 69127 6089 133893.8

3.7.8 Example: Estimating the number of tundra swans.

The Tundra Swan Cygnus columbianus, formerly known as the Whistling Swan, is a large bird withwhite plumage and black legs, feet, and beak. 5 The USFWS is responsible for conserving and pro-tecting tundra swans as a migratory bird under the Migratory Bird Treaty Act and the Fish and WildlifeConservation Act of 1980. As part of these responsibilities, it conducts regular aerial surveys at one oftheir prime breeding areas in Bristol Bay, Alaska. And, the Bristol Bay population of tundra swans is ofparticular interest because suitable habitat for nesting is available earlier than most other nesting areas.This example is based on one such survey. 6

Tundra swans are highly visible on their nesting grounds making them easy to monitor during aerialsurveys.

The Bristol Bay refuge has been divided into 186 survey units, each being a quarter section. Thesesurvey units have been divided into three strata based on density, and previous years’ data provide thefollowing information about the strata:

Density Total Past Past

Stratum Survey Units Density Std Dev

High 60 20 10

Medium 68 10 6

Low 58 2 3

Total 186

Based on past years’ results and budget considerations, approximately 30 survey units can be sam-pled.

The three strata are all approximately the same total area (number of survey units) so allocationsbased on stratum area will be approximately equal across strata. However, that would place about 1/3 ofthe effort into the low density strata which typically have fewer birds.

5Additional information about the tundra swan is available at http://www.hww.ca/hww2.asp?id=78&cid=76Doster, J. (2002). Tundra Swan Population Survey in Bristol Bay, Northern Alaska Peninsula, June 2002.

c©2019 Carl James Schwarz 166 2019-11-03

Page 69: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

It is felt that stratum density is a suitable measure of stratum importance (notice that close relationshipbetween stratum density and stratum standard deviations which is often found in biological surveys).Consequently, an allocation based on stratum density was used. The sum of the density values is 20 +10 + 2 = 32. A proportional allocation would then place about 30 × 20

32 = 18 units in the high densitystratum; about 30× 10

32 = 9 units in the medium density stratum; and the remainder (3 units) in the lowdensity stratum.

The survey was conducted with the following results:

Survey Area Swans in Single All

Unit Stratum (km2) flocks Birds Pairs birds

dilai2 h 148 12 6 24

naka41 h 137 13 15 43

naka43 h 137 6 16 38

naka51 h 16 10 3 2 17

nakb32 h 137 10 10 30

nakb44 h 135 6 18 12 48

nakc42 h 83 4 5 6 21

nakc44 h 109 17 15 47

nakd33 h 134 11 11 33

ugac34 h 65 2 10 22

ugac44 h 138 28 15 58

ugad5/63 h 159 9 20 49

dugad56/4 m 102 7 4 15

guad43 m 137 6 4 14

ugad42 m 137 5 11 15 46

low1 l 143 2 2

low3 l 138 1 1

The first thing to notice from the table above is that not all survey units could be surveyed becauseof poor weather. As always with missing data, it is important to determine if the data are MissingCompletely at Random (MCAR). In this case, it seems reasonable that swans did not adjust their behaviorknowing that certain survey units would be sampled on the poor weather days and so there is no impactof the missing data other than a loss of precision compared to a survey with a full 30 survey units chosen.

Also notice that “blanks” in the table (missing values) represent zeros and not really missing data.

Finally, not all of the survey units are the same area. This could introduce additional variation intoour data which may affect our final standard errors. Even though the survey units are of different areas,the survey units were chosen as a simple random sample so ignoring the area will NOT introduce biasinto the estimates (why). You will see in later sections how to compute a ratio estimator which couldtake the area of each survey units into account and potentially lead to more precise estimates.

The data are read into SAS in the usual fashion with the code fragment:

data swans;infile ’tundra.csv’ dlm=’,’ dsd missover firstobs=2;length survey_unit $10 stratum $1;;

c©2019 Carl James Schwarz 167 2019-11-03

Page 70: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

input survey_unit $ stratum $ area num_flocks num_single num_pairs;num_swans = num_flocks + num_single + 2*num_pairs;

The total number of survey units in each stratum is also read into SAS using the code fragment.

data total_survey_units;length stratum $1.;input stratum $ _total_; /* must use _total_ as variable name */datalines;

h 60m 68l 58;;;;

Notice that the variable that has the number of stratum units must be called _total_ as required by theSurveyMeans procedure.

Next the data are sorted by stratum (not shown), the number of actual survey units surveyed in eachstratum is found using Proc Means:

proc means data=swans noprint;by stratum;var num_swans;output out=n_units n=n;

run;

Most survey procedures in SAS require the use sampling weights. These are the reciprocal of theprobability of selection. In this case, this is simply the number of units in the stratum divided by thenumber sampled in each stratum:

data swans;merge swans total_survey_units n_units;by stratum;sampling_weight = _total_ / n;

run;

Now the individual stratum estimates are obtained using the code fragment:

/* first estimate the numbers in each stratum */proc surveymeans data=swans

total=total_survey_units /* inflation factors */sum clsum mean clm;

by stratum; /* separate estimates by stratum */var num_swans;weight sampling_weight;

c©2019 Carl James Schwarz 168 2019-11-03

Page 71: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

ods output statistics=tundraresultssep;ods output stratinfo =tundrastratainfo;

run;

This gives the output:

stratumVariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

h num_swans 35.8 3.4 28.3 43.4 2150 206 1697 2603

l num_swans 1.5 0.5 -4.7 7.7 87 28 -275 449

m num_swans 25.0 10.3 -19.2 69.2 1700 698 -1305 4705

The estimates in the L and M strata are not very precise because of the small number of survey unitsselected. SAS has incorporated the finite population correction factor when estimating the se for theindividual stratum estimates.

We estimate that about 2000 swans are present in the H and M strata, but just over 100 in the Lstratum. The grand total is found by adding the estimated totals from the strata 2150+87+1700=3937,and the standard error of the grand total is found in the usual way

√2062 + 282 + 6982 = 729.

Proc SurveyMeans can be used to estimate the grand total number of units overall strata using thecode fragment::

/* now to estimate the grand total */proc surveymeans data=swans

total=total_survey_units /* inflation factors for each stratum */sum clsum mean clm; /* want to estimate grand totals */

title2 ’Estimate total number of swans’;strata stratum /list; /* which variable define the strata */var num_swans; /* which variable to analyze */weight sampling_weight; /* sampling weight for each obs */ods output statistics=tundraresults;

run;

This gives the output:

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

num_swans 21.2 3.9 12.8 29.6 3937 729 2374 5500

The standard error is larger than desired, mostly because of the very small sample size in the Mstratum where only 3 of the 9 proposed survey units could be surveyed.

c©2019 Carl James Schwarz 169 2019-11-03

Page 72: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.7.9 Post-stratification

In some cases, it is inconvenient or impossible to stratify the population elements into strata beforesampling because the value of a variable used for stratification is only available after the unit is sampled.For example,

• we wish to stratify a sample of baby births by birth weight to estimate the proportion of birthdefects;

• we wish to stratify by family size when looking at day care costs.

• we wish to stratify by soil moisture but this can only be measured when the plot is actually visited.

We don’t know the birth weight, the family-size, or the soil moisture until after the data are collected.

There is nothing formally wrong with post-stratification and it can lead to substantial improvementsin precison.

How would post-stratification work in practise? Suppose than 20 quadrats (each 1m2) were sampledout of a 100 m2 survey area using a simple random sample, and the number of insect grubs counted ineach quadrat. When the units were sampled, the soil was classified into high or low quality habitit forthese grubs:

Grubs Post-strat

10 h

2 l

3 l

8 h

1 l

3 l

11 h

2 l

2 l

11 h

17 h

1 l

0 l

11 h

15 h

2 l

2 l

4 l

2 l

1 l

The data are available in the post-stratify.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. They are imported into SAS in theusual way:

c©2019 Carl James Schwarz 170 2019-11-03

Page 73: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

data grubs;infile ’post-stratify.csv’ dlm=’,’ dsd missover firstobs=2;length post_stratum $1.;input grubs post_stratum;

run;

The first few lines of the raw data are shown below:

Obs post_stratum grubs

1 h 10

2 l 2

3 l 3

4 h 8

5 l 1

6 l 3

7 h 11

8 l 2

9 l 2

10 h 11

If stratification is ignored, then the usual analysis using Proc SurveyMeans can be used after creatingthe appropriate survey weights (not shown, but see code):

proc surveymeans data=grubstotal=100 /* inflation factors */sum clsum mean clm;

var grubs;weight sampling_weight_overall;ods output statistics=poststratifyresultssimple;

run;

which gives:

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

grubs 5.40 1.05 3.21 7.59 540 105 321 759

The overall mean density is estimated to be 5.40 insects/m2 with a se of 1.17 m−2 (ignoring any fpc)7.The estimated total number of insects over all 100 m2 of the study area is 100× 5.40 = 540 insects witha se of 100× 1.17 = 117 insects8.

7The se is 1.05 m−2 if the fpc is used8The se of the total is 1.05 insects if the fpc is used

c©2019 Carl James Schwarz 171 2019-11-03

Page 74: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Now suppose we look at the summary statistics by the post-stratification variable. Proc Tabulate:

proc tabulate data=grubs;class post_stratum;var grubs;table post_stratum, grubs*(n*f=5.0 mean*f=5.2 std*f=5.2);

run;

gives some simple summary statistics about each stratum:

grubs

N Mean Std

post_stratum 7 11.86 3.08

h

l 13 1.92 1.04

If the area of the post-strata are known (and this is NOT always possible), you can use standard rollupfor a stratified design. Suppose that there were 30 m2 of high quality habitat and 70 m2 of low qualityhabitat. Then the roll-up proceeds as before and is summarized as:

The usual stratified analysis can be then be done:

proc surveymeans data=grubstotal=total_survey_units /* inflation factors for each stratum */sum clsum mean clm; /* want to estimate grand totals */

strata post_stratum /list; /* which variable define the strata */var grubs; /* which variable to analyze */weight sampling_weight_post_strata; /* sampling weight for each obs */ods output statistics=poststratifyresults;;

run;

giving:

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

grubs 4.9 0.4 4.2 5.7 490 36 416 565

Now the estimated total grubs is 490 with a se of 409 – a substantial improvement over the non-stratifiedanalysis. The difference in the estimates (i.e. 540 vs. 490) is well within the range of uncertainty sum-marized by the standard errors.

There are several potential problems when using post-stratification.

• The sample size in each post-stratum cannot be controlled. This implies it is not possible to use9se of 36 if fpc is used

c©2019 Carl James Schwarz 172 2019-11-03

Page 75: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

any of the allocation methods to improve precision that were discussed earlier. As well, the surveymay end up with a very small sample size in some strata.

• The reported se must be increased to account for the fact that the sample size in each stratum is nolonger fixed. This introduces an additional source of variation for the estimate, i.e. estimates willvary from sample to sample not only because a new sample is drawn each time, but also becausethe sample size within a stratum will change. However in practice, this is rarely a problem becausethe actual increase in the se is usually small and this additional adjustment is rarely every done.

• In the above example, the area of each stratum in the ENTIRE study area could be found after thefact. But in some cases, it is impossible to find the area of each stratum in the entire study area andso the rollup could not be done. In these cases, you could use the results from the post-stratificationto also estimate the area of each stratum, but now the expansion factor for each stratum also has ase and this must also be taken into account Please consult a standard book on sampling theory fordetails.

3.7.10 Allocation and precision - revisited

A student wrote:

I’m a little confused about sample allocation in stratified sampling. Earlier in the course,you stated that precision is independent of sample size, i.e. a sample of 1000 gave estimatesthat were equally precise for Canada and the US (assuming a simple random sample). Yetin stratified sampling, you also said that precision is improved by proportional allocationwhere larger strata get larger sample sizes.

Both statements are correct. If you are interested in estimates for individual populations, then abso-lute sample size is important.

If you wanted equally precise estimates for BOTH Canada and the US then you would have equalsample sizes from both populations, say 1000 from both population even though their overall populationsize differs by a factor of 10:1.

However, in stratified sampling designs, you may also be interested in the OVERALL estimate, overboth populations. In this case, a proportional allocation where sample size is allocated proportion topopulation size often performs better. In this, the overall sample of 2000 people would be allocatedproportional to the population sizes as follows:

Stratum Population Fraction of total population Sample size

US 300,000,000 91% 91% x2000=1818

Canada 30,000,000 9% 9% x2000=181

Total 330,000,000 100% 2000

Why does this happen? Well if you are interested in the overall population, then the US resultsessentially drives everything and Canada has little effect on the overall estimate. Consequently, it doesn’tmatter that the Canadian estimate is not as precise as the US estimate.

c©2019 Carl James Schwarz 173 2019-11-03

Page 76: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.8 Ratio estimation in SRS - improving precision with auxiliaryinformation

An association between the measured variable of interest and a second variable of interest can be ex-ploited to obtain more precise estimates. For example, suppose that growth in a sample plot is related tosoil nitrogen content. A simple random sample of plots is selected and the height of trees in the sampleplot is measured along with the soil nitrogen content in the plot. A regression model is fit (Thompson,1992, Chapters 7 and 8) between the two variables to account for some of the variation in tree height asa function of soil nitrogen content. This can be used to make precise predictions of the mean height instands if the soil nitrogen content can be easily measured. This method will be successful if there is adirect relationship between the two variables, and, the stronger the relationship, the better it will perform.This technique is often called ratio-estimation or regression-estimation.

Notice that multi-phase designs often use an auxiliary variable but this second variable is only mea-sured on a subset of the sample units and should not be confused with ratio estimators in this section.

Ratio estimation has two purposes. First, in some cases, you are interested in the ratio of two vari-ables, e.g. what is the ratio of wolves to moose in a region of the province.

Second, a strong relationship between two variables can be used to improve precision without in-creasing sampling effort. This is an alternative to stratification when you can measure two variables oneach sampling unit.

We define the population ratio asR = τYτX

= µYµX

. Here Y is the variable of interest; X is a secondaryvariable not really of interest. Note that notation differs among books - some books reverse the role ofX and Y .

Why is the ratio defined in this way? There are two common ratio estimators, traditionally called themean-of-ratio and the ratio-of-mean estimators. Suppose you had the following data for Y and X whichrepresent the counts of animals of species 1 and 2 taken on 3 different days:

Sample

1 2 3

Y 10 100 20

X 3 20 1

The mean-of-ratios estimator would compute the estimated ratio between Y and X as:

Rmean−of−ratio =103 + 100

20 + 201

3= 9.44

while the ratio-of-means would be computed as:

Rratio−of−means =(10 + 100 + 20)/3

(3 + 20 + 1)/3=

10 + 100 + 20

3 + 20 + 1= 5.41

Which is ”better”?

The mean-of-ratio estimator should be used when you wish to give equal weight to each pair ofnumbers regardless of the magnitude of the numbers. For example, you may have three plots of land,and you measure Y and X on each plot, but because of observer efficiencies that differ among plots, theraw numbers cannot be compared. For example, in a cloudy, rainy day it is hard to see animals (firstcase), but in a clear, sunny day, it is easy to see animals (second case). The actual numbers themselvescannot be combined directly.

c©2019 Carl James Schwarz 174 2019-11-03

Page 77: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The ratio-of-means estimator (considered in this chapter) gives every value of Y andX equal weight.Here the fact that unit 2 has 10 times the number of animals as unit 1 is important as we are interestedin the ratio over the entire population of animals. Hence, by adding the values of Y and X first, eachanimals is given equal weight.

When is a ratio estimator better - what other information is needed? The higher the correlationbetween Xi and Yi, the better the ratio estimator is compared to a simple expansion estimator. It turnsout that the ratio estimator is the ‘best’ linear estimator if

• the relation between Yi and Xi is linear through the origin

• the variation around the regression line is proportional to the X value, i.e. the spread around theregression line increases as X increases unlike an ordinary regression line where the spread isassumed to be constant in all parts of the line.

In practice, plot yi vs. xi from the sample and see what type of relation exists.

When can a ratio estimator be used? A ratio estimator will require that another variable (the Xvariable) be measured on the selected sampling units. Furthermore, if you are estimating the overallmean or total, the total value of the X-variable over the entire population must also be known. Forexample, as see in the examples to come, the total area must be known to estimate the total animals oncethe density (animals/ha) is known.

3.8.1 Summary of Main results

Quantity Population value Sample estimate se

Ratio R = τYτX

= µYµX

r = yx = y

x

√1µ2X

s2diff

n (1− f)

Total τY = RτX τratio = rτX τX ×√

1µ2X

s2diff

n (1− f)

Mean µY = RµX µY ratio = rµX µX ×√

1µ2X

s2diff

n (1− f)

Notes

Don’t be alarmed by the apparent complexity of the formulae above. They are relatively simple toimplement in spreadsheets.

• The term s2diff =

n∑i=1

(yi−rxi)2n−1 is computed by creating a new column yi − rxi and finding the

(sample standard deviation)2 of this new derived variable. This will be illustrated in the examples.

• In some cases the µ2X in the denominator may or may not be known and it or its estimate x2 can

be used in place of it. There doesn’t seem to be any empirical evidence that either is better.

• The term τ2X/µ

2X reduces to N2.

• Confidence intervals Confidence limits are found in the usual fashion. In general, the distributionof R is positively skewed and so the upper bound is usually too small. This skewness is caused bythe variation in the denominator of the the ratio. For example, suppose that a random variable (Z)has a uniform distribution between 0.5 and 1.5 centered on 1. The inverse of the random variable

c©2019 Carl James Schwarz 175 2019-11-03

Page 78: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

(i.e. 1/Z) now ranges between 0.666 and 2 - no longer symmetrical around 1. So if a symmetricconfidence interval is created, the width will tend not to match the true distribution.

This skewness is not generally a problem if the sample size is at least 30 and the relative standarderror of y and x are both less than 10%.

• Sample size determination: The appropriate sample size to obtain a specified size of confidenceinterval can be found by iniert ingthe formulae for the se for the ratio. This can be done on a spreadsheet using trial and error or the goal seek feature of the spreadsheet as illustated in the examplesthat follow.

3.8.2 Example - wolf/moose ratio

[This example was borrowed from Krebs, 1989, p. 208. Note that Krebs interchanges the use of x and yin the ratio.]

Wildlife ecologists interested in measuring the impact of wolf predation on moose populations in BCobtained estimates by aerial counting of the population size of wolves and moose on 11 sub-areas (allroughly equal size) selected as SRSWOR from a total of 200 sub-areas in the game management zone.

In this example, the actual ratio of wolves to moose is of interest.

Here are the raw data:

Sub-areas Wolves Moose1 8 1902 15 3703 9 4604 27 7255 14 2656 3 877 12 4108 19 6759 7 290

10 10 37011 16 510

What is the population and parameter of interest?

As in previous situations, there is some ambiguity:

• The population of interest is the 200 sub-areas in the game-management zone. The sampling unitsare the 11 sub-areas. The response variables are the wolf and moose populations in the gamemanagement sub-area. We are interested in the wolf/moose ratio.

• The populations of interest are the moose and wolves. If individual measurements were taken ofeach animal, then this definition would be fine. However, only the total number of wolves andmoose within each sub-area are counted - hence a more proper description of this design wouldbe a cluster design. As you will see in a later section, the analysis of a cluster design starts bysumming to the cluster level and then treating the clusters as the population and sampling unit asis done in this case.

c©2019 Carl James Schwarz 176 2019-11-03

Page 79: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Having said this, do the number of moose and wolves measured on each sub-area include youngmoose and young wolves or just adults? How will immigration and emigration be taken care of?

What was the frame? Was it complete?

The frame consists of the 200 sub-areas of the game management zone. Presumably these 200 sub-areas cover the entire zone, but what about emigration and immigration? Moose and wolves may moveinto and out of the zone.

What was the sampling design?

It appears to be an SRSWOR design - the sampling units are the sub-areas of the zone.

How did they determine the counts in the sub-areas? Perhaps they simply looked for tracks in thesnow in winter - it seems difficult to get estimates from the air in summer when there is lots of vegetationblocking the view.

Excel analysis

A copy of the worksheet to perform the analysis of this data is called wolf and is available in the Allof-data workbook from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

Here is a summary shot of the spreadsheet:

c©2019 Carl James Schwarz 177 2019-11-03

Page 80: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Assessing conditions for a ratio estimator

The ratio estimator works well if the relationship between Y andX is linear, through the origin, withincreasing variance with X . Begin by plotting Y (wolves) vs. X (moose).

c©2019 Carl James Schwarz 178 2019-11-03

Page 81: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The data appears to satisfy the conditions for a ratio estimator.

Compute summary statistics for both Y and X

Refer to the screen shot of the spreadsheet. The Excel builtin functions are used to compute thesample size, sample mean, and sample standard deviation for each variable.

Compute the ratio

The ratio is computed using the formula for a ratio estimator in a simple random sample, i.e.

r =y

x

Compute the difference variable

Then for each observation, the difference between the observed Y (the actual number of wolves)and the predicted Y based on the number of moose (Yi = rXi) is found. Notice that the sum of thedifferences must equal zero.

The standard deviation of the differences will be needed to compute the standard error for the esti-mated ratio.

Estimate the standard error of the estimated ratio

Use the formula given at the start of the section.

c©2019 Carl James Schwarz 179 2019-11-03

Page 82: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Final estimate Our final result is that the estimated ratio is 0.03217 wolf/moose with an estimated se of0.00244 wolf/moose. An approximate 95% confidence interval would be computed in the usual fashion.

Planning for future surveys

Our final estimate has an approximate rse of 0.00244/.03217 = 7.5% which is pretty good. You couldtry different n values to see what sample size would be needed to get a rse of better than 5% or perhapsthis is too precise and you only want a rse of about 10%.

As an approximated answer, recall that se usally vary by√n. A rse of 5%, is smaller by a factor

of .075/.05 = 1.5 which will require an increase of 1.52 = 2.25 in the sample size, or about nnew =2.25× 11 = 25 units (ignoring the fpc).

If the raw data are available, you can also do a “bootstrap” selection (with replacement) to investigatethe effect of sample size upon the se. For each different bootstrap sample size, estimate the ratio, the seand then increase the sample size until the require se is obtained. This is relatively easy to do in SASusing the Proc SurveySelect that can select samples of arbitrary size. In some packages, such as JMP,sampling is without replacement so a direct sampling of 3x the observed sample size is not possible. Inthis case, create a pseudo-data set by pasting 19 copies of the raw data after the original data. Then usethe Table→Subset→Random Sample Size to get the approximate bootstrap sample. Again compute theratio and its se, and increase the sample size until the required precision is obtained.

If you want to be more precise about this, notice that the formula for the se of a ratio is found as:√1

µ2X

s2diff

n(1− f)

From the spreadsheet we extract various values and find that the se of the ratio is√1

395.642

3.292

n(1− n

200)

Different value of n can be tried until the rse is 5%. This gives a sample size of about 24 units.

If the actual raw data are not available, all is not lost. You would require the approximate MEANof X (µX ), the standard DEVIATION of Y , the standard DEVIATION of X , the CORRELATIONbetween Y and X , the approximate ratio (R), and the approximate number of total sample units (N ).The correlation determines how closely Y can be predicted fromX and essentially determines how muchbetter you will do using a ratio estimator. If the correlation is zero, there is NO gain in precison using aratio estimator over a simple mean.

The se of r is then found as:

se(r) =

√1

µ2X

V (y) +R2V (x)− 2Rcorr(y, x)√V (x)V (y)

n(1− n

N)

Different values of n can be tried to obtain the desired rse . This is again illustrated on the spreadsheet.

SAS Analysis

The above computations can also be done in SAS with the program wolf.sas available from the SampleProgram Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. Ituses Proc SurveyMeans which gives the output contained in wolf.lst.

The SAS program again starts with the DATA step to read in the data.

c©2019 Carl James Schwarz 180 2019-11-03

Page 83: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

data wolf;infile ’wolf.csv’ dlm=’,’ dsd missover firstobs=2;input subregion wolf moose;

run;

Because the sampling weights are equal for all observation, it is not necessary to include them whenestimating a ratio (the weights cancel out in the formula used by SAS).

Proc SGplot procedure creates the plot similar to that in the Excel spreadsheet.

proc sgplot data=wolf;title2 ’plot to assess assumptions’;scatter x=wolf y=moose;

run;

giving:

There appears to be a linear relationship between the two variables that goes through the origin whichthe condition under which a ratio estimator is sensible.

Finally, the Proc SurveyMeans procedure does the actual computation:

proc surveymeans data=wolf ratio clm N=200;

c©2019 Carl James Schwarz 181 2019-11-03

Page 84: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

title2 ’Estimate of wolf to moose ratio’;/* ratio clm - request a ratio estimator with confidence intervals *//* N=200 specifies total number of units in the population */var moose wolf;ratio wolf/moose; /* this statement ask for ratio estimator */

ods output statistics=wolfresultsods output Ratio =wolfratio;

Estimates are obtained for each variable:

VariableName Mean

SEMean

LCLMean

UCLMean

moose 395.6 56.5 269.8 521.4

wolf 12.7 1.9 8.4 17.0

The RATIO statement in the SurveyMeans procedure requests the computation of the ratio estimator.Here is the output:

NumeratorVariable

DenominatorVariable Ratio LowerCL StdErr UpperCL

wolf moose 0.032169 0.02673676 0.002438 0.03760148

The results are identical to those from the spreadsheet.

Again, it is easier to do planning in the Excel spreadsheet rather than in the SAS program.

CAUTION. Ordinary regression estimation from standard statistical packages provide only an AP-PROXIMATION to the correct analysis of survey data. There are two problems in using standard statis-tical packages for regression and ratio estimation of survey data:

• Assumes a simple random sample. If your data is NOT collected using a simple random sample,then ordinary regression methods should NOT be used.

• Unable to use a finite population correction factor. This is usually not a problem unless the samplesize is large relative to the population size.

• Wrong error structure. Standard regression analyses assume that the variance around the regres-sion or ratio line is constant. In many survey problems this is not true. This can be partiallyalleviated through the use of weighted regression, but this still does not completely fix the prob-lem. For further information about the problems of using standard statistical software packages insurvey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

Using ordinary regression

Because the ratio estimator assumes that the variance of the response increases with the value of X ,a new column representing the inverse of the X variable (i.e. 1/the number of moose) has been created.

We start by plotting the data to assess if the relationship is linear and through the origin. The Y

c©2019 Carl James Schwarz 182 2019-11-03

Page 85: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

variable is the number of wolves; the X variable is the number of moose. If the relationship is notthrough the origin, then a more complex analysis called Regression estimation is required.

The graph looks like it is linear through the origin which is one of the assumptions of the ratio estimator.

Now we wish to fit a straight line THROUGH THE ORIGIN. By default, most computer packagesinclude the intercept which we want to force to zero. We must also specify that the inverse of the Xvariable (1/X) is the weighting variable.

We see that the estimated ratio (.032 wolves/moose) matches the Excel output, the estimated standarderror (.0026) does not quite match Excel. The difference is a bit larger than can be accounted for notusing the finite population correction factor.

As a matter of interest, if you repeat the analysis WITHOUT using the inverse of the X variableas the weighting variable, you obtain an estimated ratio of .0317 (se .0022). All of these estimates aresimilar and it likely makes very little difference which is used.

Finding the required sample size is trickier because of the weighted regression approach used by thepackages, the slightly different way the se is computed, and the lack of a fpc. The latter two issues areusually not important in determining the approximate sample size, but the first issue is crucial.

Start by REFITTING Y vs. X WITHOUT using the weighting variable. This will give you roughlythe same estimate and se, but now it is much easier to extract the necessary information for sample sizedetermination.

When the UNWEIGHTED model is fit, you will see that Root Mean Square Error has the value of3.28. This is the value of sdiff that is needed. The approximate se for r (ignoring the fpc) is

se(r) ≈ sdiffµx√n

=≈ 3.28

395.64√n

Again different value of n can be tried to get the appropriate rse . This gives an n of about 25 or 26which is sufficient for planning purposes‘

Post mortem

No population numbers can be estimated using the ratio estimator in this case because of a lack ofsuitable data.

In particular, if you had wanted to estimate the total wolf population, you would have to use thesimple inflation estimator that we discussed earlier unless you had some way of obtaining the totalnumber of moose that are present in the ENTIRE management zone. This seems unlikely.

However, refer to the next example, where the appropriate information is available.

3.8.3 Example - Grouse numbers - using a ratio estimator to estimate a popula-tion total

In some cases, a ratio estimator is used to estimate a population total. In these cases, the improvement inprecision is caused by the close relationship between two variables.

c©2019 Carl James Schwarz 183 2019-11-03

Page 86: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Note that the population total of the auxiliary variable will have to be known in order to use thismethod.

Grouse Numbers

A wildlife biologist has estimated the grouse population in a region containing isolated areas (calledpockets) of bush as follows: She selected 12 pockets of bush at random, and attempted to count thenumbers of grouse in each of these. (One can assume that the grouse are almost all found in the bush,and for the purpose of this question, that the counts were perfectly accurate.) The total number of pocketsof bush in the region is 248, comprising a total area of 3015 hectares. Results are as follows:

Area Number

(ha) Grouse

8.9 24

2.7 3

6.6 10

20.6 36

3.7 8

4.1 8

25.8 60

1.8 5

20.1 35

14.0 34

10.1 18

8.0 22

The data is available in the grouse.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. The SAS program starts in the usual fashionto read the data:

data grouse;infile ’grouse.csv’ dlm=’,’ dsd missover firstobs=2;input area grouse; /* sampling weights not needed */

run;

The first few lines of the raw data are shown below:

Obs area grouse _TYPE_ _FREQ_ n sampweight

1 8.9 24 0 12 12 20.6667

2 2.7 3 0 12 12 20.6667

3 6.6 10 0 12 12 20.6667

4 20.6 36 0 12 12 20.6667

5 3.7 8 0 12 12 20.6667

6 4.1 8 0 12 12 20.6667

7 25.8 60 0 12 12 20.6667

8 1.8 5 0 12 12 20.6667

c©2019 Carl James Schwarz 184 2019-11-03

Page 87: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Obs area grouse _TYPE_ _FREQ_ n sampweight

9 20.1 35 0 12 12 20.6667

10 14.0 34 0 12 12 20.6667

11 10.1 18 0 12 12 20.6667

12 8.0 22 0 12 12 20.6667

What is the population of interest and parameter to be estimated?

As before, the is some ambiguity:

• The population of interest are the pockets of brush in the region. The sampling unit is the pocketof brush. The number of grouse in each pocket is the response variable.

• The population of interest is the grouse. These happen to be clustered into pockets of brush. Thisleads back to the previous case.

What is the frame

Here the frame is explicit - the set of all pockets of bush. It isn’t clear if all grouse will be found inthese pockets - will some be itinerant and hence missed? What about movement between looking at thepockets of bush?

Summary statistics

Variable n mean std dev

area 12 10.53 7.91

grouse 12 21.92 16.95

Simple inflation estimator ignoring the pocket areas

Proc SurveyMeans can be used to compute the simple inflation estimator based on the pockets sur-veyed ignoring any relationship between the number of grouse and the area of the pockets. Don’t forgetthat in SAS we need to provide the survey weight, which in the case of a simple random sample is theinverse of the sampling fraction.

proc means data=grouse noprint;var grouse;output out=sampsize n=n;

run;data grouse; /* get the survey weights */

set grouse;one=1;set sampsize point=one;sampweight = 248 / n;

run;

proc surveymeans data=grouse mean sum clm clsum N=248;title2 ’Estimation using a simple expansion estimator estimator’;var grouse;

c©2019 Carl James Schwarz 185 2019-11-03

Page 88: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

weight sampweight;ods output statistics=grousesimpleresults;

run;

giving:

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

grouse 21.917 4.772 11.413 32.420 5435.333 1183.488 2830.493 8040.173

If we wish to adjust for the sampling fraction, we can use our earlier results for the simple inflationestimator, our estimate of the total number of grouse is τ = Ny = 248 × 21.92 = 5435.33 with an

estimated se of se = N ×√

s2

n (1− f) = 248×√

16.952

12 (1− 12248 ) = 1183.4.

The estimate isn’t very precise with a rse of 1183.4/5435.3 = 22%.

Ratio estimator - why?

Why did the inflation estimator do so poorly? Part of the reason is the relatively large standard deviationin the number of grouse in the pockets. Why does this number vary so much?

It seems reasonable that larger pockets of brush will tend to have more grouse. Perhaps we can dobetter by using the relationship between the area of the bush and the number of grouse through a ratioestimator.

Excel analysis

An Excel worksheet is available in the grouse tab in the AllofData workbook from the Sample ProgramLibrary at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

Preliminary plot to assess if ratio estimator will work

First plot numbers of grouse vs. area and see if this has a chance of succeeding.

c©2019 Carl James Schwarz 186 2019-11-03

Page 89: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The graph shows a linear relationship, through the origin. There is some evidence that the varianceis increasing with X (area of the plot).

The spreadsheet is set up similarly to the previous example:

c©2019 Carl James Schwarz 187 2019-11-03

Page 90: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The total of the X variable (area) will need to be known.

As before, you find summary statistics for X and Y , compute the ratio estimate, find the differencevariables, find the standard deviation of the difference variable, and find the se of the estimated ratio.

The estimated ratio is: r = y/x = 21.82/10.53 = 2.081 grouse/ha.

c©2019 Carl James Schwarz 188 2019-11-03

Page 91: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The se of r is found as

se(r) =

√1

x2 ×s2

diff

n× (1− f) =

√1

10.5332× 4.74642

12× (1− 12

248) = 0.1269

grouse/ha.

In order to estimate the population total of Y , you now multiply the estimated ratio by the populationtotal of X . We know the pockets cover 3015 ha, and so the estimated total number of grouse is found byτY = τX × r = 3015× 2.081 = 6273.3 grouse.

To estimate the se of the total, multiply the se of r by 3015 as well: se(τY ) = τX × se(r) =3015× 0.1269 = 382.6 grouse.

The precision is much improved compared to the simple inflation estimator. This improvement isdue to the very strong relationship between the number of grouse and the area of the pockets.

CAUTION. Ordinary regression estimation from standard statistical packages provide only anAPPROXIMATION to the correct analysis of survey data. It is tempting to use ordinary regressionmethods to compute the ratio and then expand this ratio to estimate the total. There are two problems inusing standard statistical packages for regression and ratio estimation of survey data:

• Assumes that a simple random sample was taken. If the sampling design is not a simple randomsample, then regular regression cannot be used.

• Unable to use a finite population correction factor. This is usually not a problem unless the samplesize is large relative to the population size.

• Wrong error structure. Standard regression analyses assume that the variance around the regres-sion or ratio line is constant. In many survey problems this is not true. This can be partiallyalleviated through the use of weighted regression, but this still does not completely fix the prob-lem. For further information about the problems of using standard statistical software packages insurvey sampling please refer to the article at http://www.fas.harvard.edu/~stats/survey-soft/donna_brogan.html.

However, in a simple random sample, the correct analysis can be created using ordinary regressionif a weighted regression is used. Because the ratio estimator assumes that the variance of the responseincreases with the value of X , a new column representing the inverse of the X variable (i.e. 1/area ofpocket) has been created. This is the method that is used when JMP is used to analyze the data.

Because SAS and R have procedures for the analysis of survey sampling, it is not necessary to dothe weighted regression. Nor, it it necessary to include a computation of the sampling weight if the dataare collected in a simple random sample for a ratio estimator – the weights will cancel out when thesepackages are used.

SAS Analysis

Proc SGplot creates the standard plot of numbers of grouse vs. the area of each grove.

proc sgplot data=grouse;title2 ’plot to assess assumptions’;

c©2019 Carl James Schwarz 189 2019-11-03

Page 92: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

scatter y=grouse x=area;run;

giving:

The relationship between the number of grouse and the pocket area is linear through the origin – theconditions under which a ratio estimator will perform well.

Proc SurveyMeans procedure can estimate the ratio of grouse/ha but cannot directly estimate thepopulation total.

proc surveymeans data=grouse ratio clm N=248;/* the ratio clm keywords request a ratio estimator and a confidence interval. */title2 ’Estimation using a ratio estimator’;var grouse area;ratio grouse / area;ods output statistics=grouseresults;ods output ratio =grouseratio; /* extract information so that total can be estimated */

run;

The ODS statement redirects the results from the RATIO statement to a new dataset that is processedfurther to multiply by the total area of the pockets.

data outratio;/* compute estimates of the total */

c©2019 Carl James Schwarz 190 2019-11-03

Page 93: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

set grouseratio;Est_total = ratio * 3015;Se_total = stderr* 3015;UCL_total = uppercl*3015;LCL_total = lowercl*3015;format est_total se_total ucl_total lcl_total 7.1;format ratio stderr lowercl uppercl 7.3;

run;

The output is as follows:

NumeratorVariable

DenominatorVariable Ratio LowerCL StdErr UpperCL

grouse area 2.080696 1.80140636 0.126893 2.35998605

Obs Ratio StdErr LowerCL UpperCLEsttotal

Setotal

LCLtotal

UCLtotal

1 2.081 0.127 1.801 2.360 6273.3 382.6 5431.2 7115.4

The results are exactly the same as computed using Excel.

Again, it is easiest to do the sample size computations in Excel.

The ratio estimator is much more precise than the inflation estimator because of the strong relation-ship between the number of grouse and the area of the pocket.

Sample size for future surveys

If you wish to investigate different sample sizes, the simplest way would be to modify the cell corre-sponding to the count of the differences. This will be left as an exercise for the reader.

The final ratio estimate has a rse of about 6% - quite good. It is relatively straight forward toinvestigate the sample size needed for a 5% rse . We find this to be about 17 pockets.

Post mortem - a question to ponder

What if it were to turn out that grouse population size tended to be proportional to the perimeter of apocket of bush rather than its area? Would using the above ratio estimator based on a relationship witharea introduce serious bias into the ratio estimate, increase the standard error of the ratio estimate, or doboth?

3.9 Additional ways to improve precision

This section will not be examined on the exams or term tests

c©2019 Carl James Schwarz 191 2019-11-03

Page 94: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.9.1 Using both stratification and auxiliary variables

It is possible to use both methods to improve precision. However, this comes at a cost of increasedcomputational complexity.

There are two ways of combining ratio estimators in stratified simple random sampling.

1. combined ratio estimate: Estimate the numerator and denominator using stratified random sam-pling and then form the ratio of these two estimates:

rstratified,combined =µY stratifiedµXstratified

and

τY stratified,combined =µY stratifiedµXstratified

τX

We won’t consider the estimates of the se in this course, but it can be found in any textbook onsampling.

2. separate ratio estimator- make a ratio total for each stratum, and form a grand ratio by takinga weighted average of these estimates. Note that we weight by the covariate total rather than thestratum sizes. We get the following estimators for the grand ratio and grand total:

rstratified,separate =1

τX

H∑h=1

τXhrh

and

τY stratified,separate =

H∑h=1

τXhrh

Again, we won’t worry about the estimates of the se.

Why use one over the other?

• You need stratum total for separate estimate, but only population total for combined estimate

• combined ratio is less subject to risk of bias. (see Cochran, p. 165 and following). In general,the biases in separate estimator are added together and if they fall in the same direction, thentrouble. In the combined estimator these biases are reduced through stratification for numeratorand denominator

• When the ratio estimate is appropriate (regression through the origin and variance proportional tocovariate), the last term vanishes. Consequently, the combined ratio estimator will have greaterstandard error than the separate ratio estimator unless R is relatively constant from stratum tostratum. However, see above, the bias may be more severe for the separate ratio estimator. Youmust consider the combined effects of bias and precision, i.e. MSE.

3.9.2 Regression Estimators

A ratio estimator works well when the relationship between Yi and Xi is linear, through the origin, withthe variance of observations about the ratio line increasing with X . In some cases, the relationship maybe linear, but not through the origin.

c©2019 Carl James Schwarz 192 2019-11-03

Page 95: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

In these cases, the ratio estimator is generalized to a regression estimator where the linear relationshipis no longer constrained to go through the origin.

We won’t be covering this in this course.

Regression estimators are also useful if there is more than one X variable.

Whenever you use a regression estimator, be sure to plot y vs. x to assess if the assumptions for aratio estimator are reasonable.

CAUTION: If ordinary statistical packages are used to do regression analysis on survey data, youcould obtain misleading results because the usual packages ignore the way in which the data were col-lected. Virtually all standard regression packages assume you’ve collected data under a simple randomsample. If your sampling design is more complex, e.g. stratified design, cluster design, multi-state de-sign, etc, then you should use a package specifically designed for the analysis of survey data, e.g. SASand the Proc SurveyReg procedure.

3.9.3 Sampling with unequal probability - pps sampling

All of the designs discussed in previous sections have assumed that each sample unit was selected withequal probability. In some cases, it is advantageous to select units with unequal probabilities, particularlyif they differ in their contribution to the overall total. This technique can be used with any of the samplingdesigns discussed earlier. An unequal probability sampling design can lead to smaller standard errors(i.e. better precision) for the same total effort compared to an equal probability design. For example,forest stands may be selected with probability proportional to the area of the stand (i.e. a stand of 200 hawill be selected with twice the probability that a stand of 100 ha in size) because large stands contributemore to the overall population and it would be wasteful of sampling effort to spend much effort onsmaller stands.

The variable used to assign the probabilities of selection to individual study units does not need tohave an exact relationship with an individual contributions to the total. For example, in probability pro-portional to prediction (3P sampling), all trees in a small area are visited. A simple, cheap characteristicis measured which is used to predict the value of the tree. A sub-sample of the trees is then selected withprobability proportional to the predicted value, remeasured using a more expensive measuring device,and the relationship between the cheap and expensive measurement in the second phase is used with thesimple measurement from the first phase to obtain a more precise estimate for the entire area. This is anexample of two-phase sampling with unequal probability of selection.

Please consult with a sampling expert before implementing or analyzing an unequal probability sam-pling design.

3.10 Cluster sampling

In some cases, units in a population occur naturally in groups or clusters. For example, some animalscongregate in herds or family units. It is often convenient to select a random sample of herds and thenmeasure every animal in the herd. This is not the same as a simple random sample of animals becauseindividual animals are not randomly selected; the herds are the sampling unit. The strip-transect examplein the section on simple random sampling is also a cluster sample; all plots along a randomly selectedtransect are measured. The strips are the sampling units, while plots within each strip are sub-sampling

c©2019 Carl James Schwarz 193 2019-11-03

Page 96: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

units. Another example is circular plot sampling; all trees within a specified radius of a randomly selectedpoint are measured. The sampling unit is the circular plot while trees within the plot are sub-samples.

Some examples of cluster samples are:

• urchin estimation - transects are taken perpendicular to the shore and a diver swims along thetransect and counts the number of urchins in each m2 along the line.

• aerial surveys - a plane flies along a line and observers count the number of animals they see in astrip on both sides of the aircraft.

• forestry surveys - often circular plots are located on the ground and ALL tree within that plot aremeasured.

Pitfall A cluster sample is often mistakenly analyzed using methods for simple random surveys. Thisis not valid because units within a cluster are typically positively correlated. The effect of this erroneousanalysis is to come up with an estimate that appears to be more precise than it really is, i.e. the estimatedstandard error is too small and does not fully reflect the actual imprecision in the estimate.

Solution: You will pleased to know that, in fact, you already know how to design and analyze clustersamples! The proper analysis treats the clusters as a random sample from the population of clusters, i.e.treat the cluster as a whole as the sampling unit, and deal only with cluster total as the response measure.

3.10.1 Sampling plan

In simple random sampling, a frame of all elements was required in order to draw a random sample.Individual units are selected one at a time. In many cases, this is impractical because it may not bepossible to list all of the individual units or may be logistically impossible to do this. In many cases, theindividual units appear together in clusters. This is particularly true if the sampling unit is a transect- almost always you measure things on a individual quadrat level, but the actual sampling unit is thecluster.

This problem is analogous to pseudo-replication in experimental design - the breaking of the transectinto individual quadrats is like having multiple fish within the tank.

A visual comparison of a simple random sample vs. a cluster sample

You may find it useful to compare a simple random sample of 24 vs. a cluster sample of 24 using thefollowing visual plans:

Select a sample of 24 in each case.

c©2019 Carl James Schwarz 194 2019-11-03

Page 97: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Simple Random Sampling

Describe how the sample was taken.

Cluster Sampling

First, the clusters must be defined. In this case, the units are naturally clustered in blocks of size 8.The following units were selected.

c©2019 Carl James Schwarz 195 2019-11-03

Page 98: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Describe how the sample was taken. Note the differences between stratified simple random samplingand cluster sampling!

3.10.2 Advantages and disadvantages of cluster sampling compared to SRS

• Advantage It may not be feasible to construct a frame for every elemental unit, but possible toconstruct frame for larger units, e.g. it is difficult to locate individual quadrats upon the sea floor,but easy to lay out transects from the shore.

• Advantage Cluster sampling is often more economical. Because all units within a cluster are closetogether, travel costs are much reduced.

c©2019 Carl James Schwarz 196 2019-11-03

Page 99: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• Disadvantage Cluster sampling has a higher standard error than an SRSWOR of the same totalsize because units are typically homogeneous within clusters. The cluster itself serves as the sam-pling unit. For the same number of units, cluster sampling almost always gives worse precision.This is the problem that we have seen earlier of pseudo-replication.

• Disadvantage A cluster sample is more difficult to analyze, but with modern computing equip-ment, this is less of a concern. The difficulties are not arithmetic but rather being forced to treatthe clusters as the survey unit - there is a natural tendency to think that data are being thrown away.

The perils of ignoring a cluster design The cluster design is frequently used in practice, but oftenanalyzed incorrectly. For example, when ever the quadrats have been gathered using a transect of somesort, you have a cluster sampling design. The key thing to note is that the sampling unit is a cluster, notthe individual quadrats.

The biggest danger of ignoring the clustering aspects and treating the individual quadrats as if theycame from an SRS is that, typically, your reported se will be too small. That is, the true standard errorfrom your design may be substantially larger than your estimated standard error obtained from a SRSanalysis. The precision is (erroneously) thought to be far better than is justified based on the surveyresults. This has been seen before - refer to the paper by Underwood where the dangers of estimationwith positively correlated data were discussed.

3.10.3 Notation

The key thing to remember is to work with the cluster TOTALS.

Traditionally, the cluster size is denoted byM rather than byX , but as you will see in a few moment,estimation in cluster sampling is nothing more than ratio estimation performed on the cluster totals.

Population Sample

Attribute value value

Number of clusters N n

Cluster totals τi yi NOTE τi and yi are the cluster i TOTALS

Cluster sizes Mi mi

Total area M

3.10.4 Summary of main results

The key concept in cluster sampling is to treat the cluster TOTAL as the response variable and ignoreall the individual values within the cluster. Because the clusters are a simple random sample from thepopulation of clusters, simply apply all the results you had before for a SRS to the CLUSTER TOTALS.

The analysis of a cluster design will require the size of each cluster - this is simply the number ofsub-units within each cluster.

If the clusters are roughly equal in size, a simple inflation estimator can be used.

But, in many cases, there is strong relationship between the size of the cluster and cluster total – inthese cases a ratio estimator would likely be more suitable (i.e. will give you a smaller standard error),

c©2019 Carl James Schwarz 197 2019-11-03

Page 100: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

where the X variable is the cluster size. If there is no relationship between cluster size and the clustertotal, a simple inflation estimator can be used as well even in the case of unequal cluster sizes.

You should do a preliminary plot of the cluster totals against the cluster sizes to see if this relationshipholds.

Extensions of cluster analysis - unequal size sampling In some cases, the clusters are of quiteunequal sizes. A better design choice may to be select clusters with an unequal probability design ratherthan using a simple random sample. In this case, clusters that are larger, typically contribute more to thepopulation total, and would be selected with a higher

Computational formulae

Parameter Population value Estimator estimated se

Overall mean µ =

N∑i=1

τi

N∑i=1

Mi

µ =

n∑i=1

yi

n∑i=1

mi

√1m2

s2diff

n (1− f)

Overall total τ = M × µ τ = M × µ√M2 × 1

m2

s2diff

n (1− f)

• You never use the mean per unit within a cluster.

• The term s2diff =

n∑i=1

(yi−µmi)2n−1 is again found in the same fashion as in ratio estimation - create a

new variable which is the difference between yi − µmi, find the sample standard deviation2 of it,and then square the standard deviation.

• Sometimes the ratio of two variables measured within each cluster is required, e.g. you conductaerial surveys to estimate the ratio of wolves to moose - this has already been done in an earlierexample! In these cases, the actual cluster length is not used.

Confidence intervals

As before, once you have an estimator for the mean and for the se, use the usual ±2se rule. If thenumber of clusters is small, then some text books advise using a t-distribution for the multiplier – this isnot covered in this course.

Sample size determination

Again, this is no real problem - except that you will get a value for the number of CLUSTERS, notthe individual quadrats within the clusters.

3.10.5 Example - estimating the density of urchins

Red sea urchins are considered a delicacy and the fishery is worth several millions of dollars to BritishColumbia.

In order to set harvest quotas and in order to monitor the stock, it is important that the density of seaurchins be determined each year.

c©2019 Carl James Schwarz 198 2019-11-03

Page 101: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

To do this, the managers lay out a number of transects perpendicular to the shore in the urchin beds.Divers then swim along the transect, and roll a 1m2 quadrat along the transect line and count the numberof legal sized and sub-legal sized urchins in the quadrat.

The number of possible transects is so large that the correction for finite population sampling can beignored.

SAS v.8 has procedures for the analysis of survey data taken in a cluster design. A program to analyzethe data is urchin.sas and is available from the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The SAS program starts by reading in the data at the individual quadrat level:

data urchin;infile ’urchin.csv’ dlm=’,’ dsd firstobs=2 missover; /* the first record has the variable names */input transect quadrat legal sublegal;/* no need to specify sampling weights because transects are an SRS */

run;

The dataset contains variables for the transect, the quadrat within each transect, and the number oflegal and sub-legal sized urchins counted in that quadrat:

Obs transect quadrat legal sublegal

1 1 1 0 0

2 1 2 0 0

3 1 3 0 0

4 1 4 0 1

5 1 5 0 0

6 1 6 0 0

7 1 7 0 0

8 1 8 0 0

9 1 9 0 0

10 1 10 0 0

What is the population of interest and the parameter?

The population of interest is the sea urchins in the harvest area. These happened to be (artificially)“clustered” into transects which are sampled. All sea urchins within the cluster are measured.

The parameter of interest is the density of legal sized urchins.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible transects. Rather they pickrandom points along the shore and then lay the transects out from that point.

What is the sampling design?

c©2019 Carl James Schwarz 199 2019-11-03

Page 102: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The sampling design is a cluster sample - the clusters are the transect lines while the quadrats mea-sured within each cluster are similar to pseudo-replicates. The measurements within a transect are notindependent of each other and are likely positively correlated (why?).

As the points along the shore were chosen using a simple random sample the analysis proceeds as aSRS design on the cluster totals.

Excel Analysis

An Excel worksheet with the data and analysis is called urchin and is available in the AllofData workbookfrom then Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.A reduced view appears below:

c©2019 Carl James Schwarz 200 2019-11-03

Page 103: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level.You will need the cluster total and the cluster size (in this case the length of the transect). The PivotTable feature of Excel is quite useful for doing this automatically. Unfortunately, you still have to playaround with the final table in order to get the data displayed in a nice format.

Note that there was no transect numbered 5, 12, 17, 19, or 32. Why are these transects missing?According to the records of the survey, inclement weather caused cancellation of the missing transects.It seems reasonable to treat the missing transects as missing completely at random (MCAR). In this case,

c©2019 Carl James Schwarz 201 2019-11-03

Page 104: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

there is no problem in simply ignoring the missing data – all that happens is that the precision is reducedcompared to the design with all data present.

We compare the maximum(quadrat) number to the number of quadrat values actually recorded andsee that they all match indicating that it appears no empty quadrats were not recorded.

In many transect studies, there is a tendency to NOT record quadrats with 0 counts as they don’t affectthe cluster sum. However, you still have to know the correct size of the cluster (i.e. how many quadrats),so you can’t simply ignore these ‘missing’ values. In this case, you could examine the maximum of thequadrat number and the number of listed quadrats to see if these agree (why?).

Plot the cluster totals vs. the cluster size to see if a ratio estimator is appropriate, i.e. linear relationshipthrough the origin with variance increasing with cluster size.

The plot (not shown) shows a weak relationship between the two variables.

Compute the summary statistics on the cluster TOTALS. You will need the totals over all sampled clustersof both variables.

sum(legal) sum(quad) n(transect)1507 1120 28

The estimated density is then found as a ratio estimator using the cluster totals: density = sum(legal)sum(quad) =

1507/1120 = 1.345536 urchins/m2.

To compute the se, create the diff column as in the ratio estimation section and find its standard deviation.

The estimated se is then found as: se( density) =

√s2diff

ntransects× 1

quad2 =

√48.099332

28 × 1402 = 0.2272

urchins/m2.

In order to estimate the total number of urchins in the harvesting area, you simply multiply theestimated ratio and its standard error by the area to be harvested.

SAS Analysis

SAS can use the raw data directly – it is not necessary to compute the cluster totals. However, this is agood first step to check the assumptions of the (ratio) analysis, i.e. that the cluster total is approximatelylinear though the origin.

The total on the urchins and length of urchins are computed using Proc Means:

proc sort data=urchin; by transect; run;proc means data=urchin noprint;

by transect;var quadrat legal;output out=check min=min max=max n=n sum(legal)=tlegal;

run;

and then plotted:

c©2019 Carl James Schwarz 202 2019-11-03

Page 105: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

proc sgplot data=check;title2 ’Plot the relationship between the cluster total and cluster size’;scatter y=tlegal x=n / datalabel=transect; /* use the transect number as the plotting character */

run;

Because we are computing a ratio estimator from a simple random sample of transects, it is notnecessary to specify the sampling weights for the individual quadrats or the transect.

The key feature of the Proc SurveyMeans is the use of the CLUSTER statement to identify the clustersin the data.

proc surveymeans data=urchin; /* do not specify a pop size as fpc is negligble */cluster transect;var legal;ods output statistics=urchinresults;

run;

The population number of transects was not specified as the finite population correction is negligible.Here are the results:

VariableName Mean

SEMean

LCLMean

UCLMean

legal 1.346 0.227 0.879 1.812

c©2019 Carl James Schwarz 203 2019-11-03

Page 106: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The results are identical to those obtained via Excel.

Planning for future experiments

The rse of the estimate is 0.2274/1.3455 = 17% - not terrific. The determination of sample size is donein the same manner as in the ratio estimator case dealt with in earlier sections except that the number ofCLUSTERS is found. If we wanted to get a rse near to 5%, we would need almost 320 transects - thisis likely too costly.

3.10.6 Example - estimating the total number of sea cucumbers

Sea cucumbers are considered a delicacy among some, and the fishery is of growing importance.

In order to set harvest quotas and in order to monitor the stock, it is important that the number of seacucumbers in a certain harvest area be estimated each year.

The following is an example taken from Griffith Passage in BC 1994.

To do this, the managers lay out a number of transects across the cucumber harvest area. Divers thenswim along the transect, and while carrying a 4 m wide pole, count the number of cucumbers within thewidth of the pole during the swim.

The number of possible transects is so large that the correction for finite population sampling can beignored.

Here is the summary information up the transect area (the preliminary raw data is unavailable):

c©2019 Carl James Schwarz 204 2019-11-03

Page 107: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Transect Sea

Area Cucumbers

260 124

220 67

200 6

180 62

120 35

200 3

200 1

120 49

140 28

400 1

120 89

120 116

140 76

800 10

1460 50

1000 122

140 34

180 109

80 48

The total harvest area is 3,769,280 m2 as estimated by a GIS system.

The transects were laid out from one edge of the bed and the length of the edge is 51,436 m. Notethat because each transect was 4 m wide, the number of transects is 1/4 of this value.

The SAS program is available in cucumber.sas available Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The (summarized) data are read in the usual way The data are read in the Data step already sumarizedto the cluster level:

data cucumber;infile ’cucumber.csv’ dlm=’,’ dsd missover firstobs=2;input area cucumbers;transect = _n_; /* number the transects */

run;

There is no explicit transect number, so one was created based on the row number in the data file.

What is the population of interest and the parameter?

The population of interest is the sea cucumbers in the harvest area. These happen to be (artificially)“clustered” into transects which are the sampling unit. All sea cucumbers within the transect (cluster)are measured.

c©2019 Carl James Schwarz 205 2019-11-03

Page 108: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The parameter of interest is the total number of cucumbers in the harvest area.

What is the frame?

The frame is conceptual - there is no predefined list of all the possible transects. Rather they pick randompoints along the edge of the harvest area, and then lay out the transect from there.

What is the sampling design?

The sampling design is a cluster sample - the clusters are the transect lines while the quadrats mea-sured within each cluster are similar to pseudo-replicates. The measurements within a transect are notindependent of each other and are likely positively correlated (why?).

The worksheet cucumber is available in the AllofData.xls workbook from the Sample ProgramLibrary http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets illustrates thecomputations in Excel. There are three different surveys illustrated. It also computes the two estimatorswhen two potential outliers are deleted and for a second harvest area.

The key, first step in any analysis of a cluster survey is to first summarize the data to the cluster level.You will need the cluster total and the cluster size (in this case the area of the transect). This has alreadybeen done in the above data and so you don’t need to use a pivot table.

Now this summary table is simply an SRSWOR from the set of all transects. We first estimate thedensity, and then multiply by the area to estimate the total.

Note that after summarizing up to the transect level, this example proceeds in an analogous fashionas the grouse in pockets of brush example that we looked at earlier.

A plot of the cucumber total vs. the transect size shows a very poor relationship between the twovariables. It will be interesting to compare the results from the simple inflation estimator and the ratioestimator.

Simple Inflation Estimator

First, estimate the number ignoring the area of the transects by using a simple inflation estimator.

The summary statistics that we need are:

n 19 transectsMean 54.21 cucumbers/transectstd Dev 42.37 cucumbers/transect

We compute an estimate of the total as τ = Ny = (51, 436/4)× 54.21 = 697, 093 sea cucumbers.[Why did we use 51,436/4 rather than 51,436?]

We compute an estimate of the se of the total as: se(τ) =√N2s2/n× (1− f) =

√(51, 436/4)2 × 42.372/19 =

124, 981 sea cucumbers.

The finite population correction factor is so small we simply ignore it.

This gives a relative standard error (se/est) of 18%.

Ratio Estimator

c©2019 Carl James Schwarz 206 2019-11-03

Page 109: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

We use the methods outlined earlier for ratio estimators from SRSWOR to get the following summarytable:

area cucumbersMean 320.00 54.21 per transect

The estimated density of sea cucumbers is then density = mean(cucumbers)mean(area) = 54.21/320.00 = 0.169

cucumber/m2.

To compute the se, create the diff column as in the ratio estimation section and find its standard devia-

tion as sdiff = 73.63. The estimated se of the ratio is then found as: se( density) =

√s2diff

ntransects× 1

area2

=√

73.632

19 × 13202 = 0.053 cucumbers/m2.

We once again ignore the finite population correction factor.

In order to estimate the total number of cucumbers in the harvesting area, you simply multiply theabove by the area to be harvested:

τratio = area× density = 3, 769, 280× 0.169= 638,546 sea cucumbers.

The se is found as: se(τratio) = area×se( density) = 3, 769, 280×0.053 = 198,983 sea cucumbersfor an overall rse of 31%.

SAS Analysis

Because only the summary data is available, you cannot use the CLUSTER statement of Proc Sur-veyMeans. Rather, as noted earlier in the notes, you form a ratio estimator based on the cluster totals.

We begin with a plot to see the relationship between transect area and numbers of cucumbers:

proc sgplot data=cucumber;title2 ’plot the relationship between the cluster total and cluster size’;scatter y=cucumbers x=area / datalabel=transect; /* use the transect number as the plotting character */

run;

c©2019 Carl James Schwarz 207 2019-11-03

Page 110: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Because the relationship between the number of cucumbers and transect area is not very strong, Asimple inflation estimator will be tried first. The sample weights must be computed. This is equal to thetotal area of the cucumber bed divided by the number of transects taken:

/* First compute the sampling weight and add to the dataset *//* The sampling weight is simply the total pop size / # sampling units in an SRS *//* In this example, transects were an SRS from all possible transects */

proc means data=cucumber n mean std ;var cucumbers;/* get the total number of transects */output out=weight n=samplesize;

run;

data cucumber;merge cucumber weight;retain samplingweight;/* we divide the shore length by 4 because each transect is 4 m wide */if samplesize > . then samplingweight = 51436/4 / samplesize;

run;

And then the simple inflation estimator is used via Proc SurveyMeans:

proc surveymeans data=cucumber mean clm sum clsum cv ;/* N not specified as we ignore the fpc in this problem */

c©2019 Carl James Schwarz 208 2019-11-03

Page 111: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

/* mean clm - find estimate of mean and confidence intervals *//* sum clsum - find estimate of grand total and confidence intervals */title2 ’Simple inflation estimator using cluster totals’;var cucumbers;weight samplingweight;ods output statistics=cucumberresultssimple;

run;

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

cucumbers 54.211 9.719 33.791 74.630 697093.158 124980.863 434518.107 959668.208

Now for the ratio estimator. First use Proc SurveyMeans to compute the density, and then inflatedthe density by the total area of cucumber area;

proc surveymeans data=cucumber ratio clm ;/* the ratio clm keywords request a ratio estimator and a confidence interval. */title2 ’Estimation using a ratio estimator’;var cucumbers area;ratio cucumbers / area;ods output ratio=cucumberratio; /* extract information so that total can be estimated */

run;

data cucumbertotal;/* compute estimates of the total */set cucumberratio;cv = stderr / ratio; /* the relative standard error of the estimate */Est_total = ratio * 3769280;Se_total = stderr* 3769280;UCL_total = uppercl*3769280;LCL_total = lowercl*3769280;format est_total se_total ucl_total lcl_total 7.1;format cv 7.2;format ratio stderr lowercl uppercl 7.3;

run;

This gives the final results:

NumeratorVariable

DenominatorVariable Ratio LowerCL StdErr UpperCL

cucumbers area 0.169408 0.05849883 0.052791 0.28031696

Obs Ratio StdErr LowerCL UpperCLEsttotal

Setotal

LCLtotal

UCLtotal

1 0.169 0.053 0.058 0.280 638546 198983 220498 1056593

c©2019 Carl James Schwarz 209 2019-11-03

Page 112: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

Comparing the two approaches

Why did the ratio estimator do worse in this case than the simple inflation estimator in Griffiths Passage?The plot the number of sea cucumbers vs. the area of the transect:

shows virtually no relationship between the two - hence there is no advantage to using a ratio estimator.

In more advanced courses, it can be shown that the ratio estimator will do better than the inflationestimator if the correlation between the two variables is greater than 1/2 of the ratio of their respectiverelative variation (std dev/mean). Advanced computations shows that half of the ratio of their relativevariations is 0.732, while the correlation between the two variables is 0.041. Hence the ratio estimatorwill not do well.

The Excel worksheet also repeats the analysis for Griffith Passage after dropping some obvious out-liers. This only makes things worse! As well, at the bottom of the worksheet, a sample size computationshows that substantially more transects are needed using a ratio estimator than for a inflation estimator.It appears that in Griffith Passage, that there is a negative correlation between the length of the transectand the number of cucumbers found! No biological reason for this has been found. This is a cautionaryexample to illustrate the even the best laid plans can go astray - always plot the data.

A third worksheet in the workbook analyses the data for Sheep Passage. Here the ratio estimatoroutperforms the inflation estimator, but not by a wide factor.

c©2019 Carl James Schwarz 210 2019-11-03

Page 113: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.11 Multi-stage sampling - a generalization of cluster sampling

3.11.1 Introduction

All of the designs considered above select a sampling unit from the population and then do a completemeasurement upon that item. In the case of cluster sampling, this is facilitated by dividing the sam-pling unit into small observational units, but all of the observational units within the sampled cluster aremeasured.

If the units within a cluster are fairly homogeneous, then it seems wasteful to measure every unit. Inthe extreme case, if every observational unit within a cluster was identical, only a single observationalunit from the cluster needs to be selected in order to estimate (without any error) the cluster total. Sup-pose then that the observational units within a cluster were not identical, but had some variation? Whynot take a sub-sample from each cluster, e.g. in the urchin survey, count the urchins in every second orthird quadrat rather than every quadrat on the transect.

This method is called two-stage sampling. In the first stage, larger sampling units are selected usingsome probability design. In the second stage, smaller units within the selected first-stage units are se-lected according to a probability design. The design used at each stage can be different, e.g. first stageunits selected using a simple random sample, but second stage units selected using a systematic designas proposed for the urchin survey above.

This sampling design can be generalized to multi-stage sampling.

Some example of multi-stage designs are:

• Vegetation Resource Inventory. The forest land mass of BC has been mapped using aerial meth-ods and divided into a series of polygons representing homogeneous stands of trees (e.g. a standdominated by Douglas-fir). In order to estimate timber volumes in an inventory unit, a sampleof polygons is selected using a probability-proportional-to-size design. In the selected polygons,ground measurement stations are selected on a 100 m grid and crews measure standing timber atthese selected ground stations.

• Urchin survey Transects are selected using a simple random sample design. Every second or thirdquadrat is measured after a random starting point.

• Clam surveys Beaches are divided into 1 ha sections. A random sample of sections is selectedand a series of 1 m2 quadrats are measured within each section.

• Herring spawns biomass Schweigert et al. (1985, CJFAS, 42, 1806-1814) used a two-stagedesign to estimate herring spawn in the Strait of Georgia.

• Georgia Strait Creel Survey The Georgia Strait Creel Survey uses a multi-stage design to selectlanding sites within strata, times of days to interview at these selected sites, and which boats tointerview in a survey of angling effort on the Georgia Strait.

Some consequences of simple two-stage designs are:

• If the selected first-stage units are completely enumerated then complete cluster sampling results.

• If every first-stage unit in the population is selected, then a stratified design results.

• A complete frame is required for all first-stage units. However, a frame of second-stage and lower-stage units need only be constructed for the selected upper-stage units.

c©2019 Carl James Schwarz 211 2019-11-03

Page 114: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

• The design is very flexible allowing (in theory) different selection methods to be used at eachstage, and even different selection methods within each first stage unit.

• A separate randomization is done within each first-stage unit when selecting the second-stageunits.

• Multi-stage designs are less precise than a simple random sample of the same number of finalsampling units, but more precise than a cluster sample of the same number of final sampling units.[Hint: think of what happens if the second-stage units are very similar.]

• Multi-stage designs are cheaper than a simple random sample of the same number of final samplingunits, but more expensive than a cluster sample of the same number of final sampling units. [Hint:think of the travel costs in selecting more transects or measuring quadrats within a transect.]

• As in all sampling designs, stratification can be employed at any level and ratio and regressionestimators are available. As expected, the theory becomes more and more complex, the more"variations" are added to the design.

The primary incentives for multi-stage designs are that

1. frames of the final sampling units are typically not available

2. it often turns out that most of the variability in the population occurs among first-stage units. Whyspend time and effort in measuring lower stage units that are relatively homogeneous within thefirst-stage unit

3.11.2 Notation

A sample of n first-stage units (FSU) is selected from a total of N first-stage units. Within the ith

first-stage unit, mi second-stage units (SSU) are selected from the Mi units available.

Item Population Sample

Value Value

First stage units N n

Second stage units Mi mi

SSUs in population M =∑Mi

Value of SSU Yij yij

Total of FSU τi τi = Mi/mi

mi∑j=1

yij

Total in pop τ =∑τi

Mean in pop µ = τ/M

3.11.3 Summary of main results

We will only consider the case when simple random sampling occurs at both stages of the design.

c©2019 Carl James Schwarz 212 2019-11-03

Page 115: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

The intuitive explanation for the results is that a total is estimated for each FSU selected (based on theSSU selected). These estimated totals are then used in a similar fashion to a cluster sample to estimatethe grand total.

Parameter Population Estimated

value Estimate se

Total τ =∑τi

Nn

n∑i=1

τi se (τ) =

√N2 (1− f1)

s21n + N2f1

n2

n∑i=1

M2i (1− f2)

s22imi

Mean µ = τM µ = τ

M se (µ) =√

se2(τ)M2

where

s21 =

n∑i=1

(τi − τ

)2

n− 1

s22i =

mi∑j=1

(yij − yi)2

mi − 1

τ =1

n

n∑i=1

τi

f1 = n/N and f2i = mi/Mi

Notes:

• There are two contributions to the estimated se - variation among first stage totals (s21) and variation

among second stage units (S22i).

• If the FSU vary considerably in size, a ratio estimator (not discussed in these notes) may be moreappropriate.

Confidence Intervals The usual large sample confidence intervals can be used.

3.11.4 Example - estimating number of clams

A First Nations wished to develop a wild oyster fishery. As first stage in the development of the fishery,a survey was needed to establish the current stock in a number of oyster beds. This example looks at theestimate of oyster numbers from a survey conducted in 1994.

The survey was conducted by running a line through the oyster bed – the total length was 105 m.Several random location were located along the line. At each randomly chosen location, the width of thebed was measured and about 3 random location along the perpendicular transect at that point were taken.A 1 m2 quadrat was applied, and the number of oysters of various sizes was counted in the quadrat.

c©2019 Carl James Schwarz 213 2019-11-03

Page 116: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

tran- width quad- total Net

Location sect width rat seed xsmall small med large count weight

(m) (m) (kg)

Lloyd 5 17 3 18 18 41 48 14 139 14.6

Lloyd 5 17 5 6 4 30 9 4 53 5.2

Lloyd 5 17 10 15 21 44 13 11 104 8.2

Lloyd 7 18 5 8 10 14 5 3 40 6.0

Lloyd 7 18 12 10 38 36 16 4 104 10.2

Lloyd 7 18 13 0 15 12 3 3 33 4.6

Lloyd 18 14 1 11 8 5 9 19 52 7.8

Lloyd 18 14 5 13 23 68 18 11 133 12.6

Lloyd 18 14 8 1 29 60 2 1 93 10.2

Lloyd 30 11 3 17 1 13 13 2 46 5.4

Lloyd 30 11 8 12 16 23 22 14 87 6.6

Lloyd 30 11 10 23 15 19 17 1 75 7.0

Lloyd 49 9 3 10 27 15 1 0 53 2.0

Lloyd 49 9 5 13 7 14 11 4 49 6.8

Lloyd 49 9 8 10 25 17 16 11 79 6.0

Lloyd 76 21 4 3 3 11 7 0 24 4.0

Lloyd 76 21 7 15 4 32 26 24 101 12.4

Lloyd 76 21 11 2 19 14 19 0 54 5.8

Lloyd 79 18 1 14 13 7 9 0 43 3.6

Lloyd 79 18 4 0 32 32 27 16 107 12.8

Lloyd 79 18 11 16 22 43 18 8 107 10.6

Lloyd 84 19 1 14 32 25 39 7 117 10.2

Lloyd 84 19 8 25 43 42 17 3 130 7.2

Lloyd 84 19 15 5 22 61 30 13 131 14.2

Lloyd 86 17 8 1 19 32 10 8 70 8.6

Lloyd 86 17 11 8 17 13 10 3 51 4.8

Lloyd 86 17 12 7 22 55 11 4 99 9.8

Lloyd 95 20 1 17 12 20 18 4 71 5.0

Lloyd 95 20 8 32 4 26 29 12 103 11.6

Lloyd 95 20 15 3 34 17 11 1 66 6.0

The data is available in the wildoyster.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The data are read in the usual way:

data oyster;infile ’wildoyster.csv’ dlm=’,’ dsd missover firstobs=2;input loc $ transect width quad small xsamll small med large total weight;sampweight = 105/10 * width/3; /* sampling weight = product of sampling fractions */

c©2019 Carl James Schwarz 214 2019-11-03

Page 117: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

run;

The sample weight is computed as the product of the sampling fraction at the first stage and the secondstage.

These multi-stage designs are complex to analyze. Rather than trying to implement the variousformulae, I would suggest that a proper sampling package be used (such as SAS, or R) rather than tryingto do these by hand.

If using simple packages, the first step is to move everything up to the primary sampling unit level.We need to estimate the total at the primary sampling unit, and to compute some components of thevariance from the second stage of sampling.

Excel Spreadsheet

The analysis was done in Excel as shown in the wildoyster worksheet in the ALLofData.xls workbookfrom the Sample Program Library.

Because Excel does not have any explicit functions for the analysis of survey data, we need to firstestimate the cluster size and cluster totals at the first stage, and then use the standard ratio estimators onthese estimated totals.

As in the case of a pure cluster sample, the PivotTable feature can be used to compute summarystatistics needed to estimate the various components.

SAS Analysis

Proc SurveyMeans is used directly with the two-stage design. The cluster statement identifies the firststage of the sampling.

/* estimate the total biomass on the oyster bed *//* Note that SurveyMeans only use a first stage variance in its

computation of the standard error. As the first stage samplingfraction is usually quite small, this will tend to give onlyslight underestimates of the true standard error of the estimate */

proc surveymeans data=oystertotal=105 /* length of first reference line */mean clmeansum clsum ; /* interested in total biomass estimate */cluster transect; /* identify the perpindicular transects */var weight;weight sampweight;ods output statistics=oysterresults;

run;

Note that the Proc SurveyMeans computes the se using only the first stage standard errors. As the

c©2019 Carl James Schwarz 215 2019-11-03

Page 118: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

first stage sampling fraction is usually quite small, this will tend to give only slight underestimates of thetrue standard error of the estimate.

The final results are:

VariableName Mean

SEMean

LCLMean

UCLMean Sum

SEsum

LCLSum

UCLSum

weight 8.2 0.5 7.1 9.2 14070.000 1444.920 10801.364 17338.636

Our final estimate is a total biomass of 14,070 kg with an estimated se of 1484 kg.

A similar procedure can be used for the other variables.

3.11.5 Some closing comments on multi-stage designs

The above example barely scratches the surface of multi-stage designs. Multi-stage designs can be quitecomplex and the formulae for the estimates and estimated standard errors fearsome. If you have toanalyze such a design, it is likely better to invest some time in learning one of the statistical packagesdesigned for surveys (e.g. SAS v.8) rather than trying to program the tedious formulae by hand.

There are also several important design decisions for multi-stage designs.

• Two-stage designs have reduced costs of data collection because units within the FSU are easierto collect but also have a poorer precision compared to a simple-random sample with the samenumber final sampling units. However, because of the reduced cost, it often turns out the moreunits can be sampled under a multi-stage design leading to an improved precision for the samecost as a simple-random sample design. There is a tradeoff between sampling more first stageunits and taking a small sub-sample in the secondary stage. An optimal allocation strategy can beconstructed to decide upon the best strategy – consult some of the reference books on samplingfor details.

• As with ALL sampling designs, stratification can be used to improve precision. The stratificationusually takes place at the first sampling unit stage, but can take place at all stages. The details ofestimation under stratification can be found in many sampling texts.

• Similarly, ratio or regression estimators can also be used if auxiliary information is available thatis correlated with the response variable. This leads to very complex formulae!

One very nice feature of multi-stage designs is that if the first stage is sampled with replacement,then the formulae for the estimated standard errors simplify considerably to a single term regardlessof the design used in the lower stages! If there are many first stage units in the population and if thesampling fraction is small, the chances of selecting the same first stage unit twice are very small. Even ifthis occurs, a different set of second stage units will likely be selected so there is little danger of havingto measure the same final sampling unit more than once. In such situations, the design at second andlower stages is very flexible as all that you need to ensure is that an unbiased estimate of the first-stageunit total is available.

c©2019 Carl James Schwarz 216 2019-11-03

Page 119: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.12 Analytical surveys - almost experimental design

In descriptive surveys, the objective was to simply obtain information about one large group. In obser-vational studies, two deliberately selected sub-populations are selected and surveyed, but no attempt ismade to generalize the results to the whole population. In analytical studies, sub-populations are selectedand sampled in order to generalize the observed differences among the sub-population to this and othersimilar populations.

As such, there are similarities between analytical and observational surveys and experimental de-sign. The primary difference is that in experimental studies, the manager controls the assignment ofthe explanatory variables while measuring the response variables, while in analytical and observationalsurveys, neither set of variables is under the control of the manager. [Refer back to Examples B, C, andD in the earlier chapters] The analysis of complex surveys for analytical purposes can be very difficult(Kish 1987; Kish, 1984; Rao, 1973; Sedransk, 1965a, 1965b, 1966).

As in experimental studies, the first step in analytical surveys is to identify potential explanatoryvariables (similar to factors in experimental studies). At this point, analytical surveys can be usuallyfurther subdivided into three categories depending on the type of stratification:

• the population is pre-stratified by the explanatory variables and surveys are conducted in eachstratum to measure the outcome variables;

• the population is surveyed in its entirety, and post-stratified by the explanatory variables.

• the explanatory variables can be used as auxiliary variables in ratio or regression methods.

[It is possible that all three types of stratification take place - these are very complex surveys.]

The choice between the categories is usually made by the ease with which the population can bepre-stratified and the strength of the relationship between the response and explanatory variables. Forexample, sample plots can be easily pre-stratified by elevation or by exposure to the sun, but it would bedifficult to pre-stratify by soil pH.

Pre-stratification has the advantage that the manager has control over the number of sample pointscollected in each stratum, whereas in post- stratification, the numbers are not controllable, and may leadto very small sample sizes in certain strata just because they form only a small fraction of the population.

For example, a manager may wish to investigate the difference in regeneration (as measured by thedensity of new growth) as a function of elevation. Several cut blocks will be surveyed. In each cut block,the sample plots will be pre-stratified into three elevation classes, and a simple random sample will betaken in each elevation class. The allocation of effort in each stratum (i.e. the number of sample plots)will be equal. The density of new growth will be measured on each selected sample plot. On the otherhand, suppose that the regeneration is a function of soil pH. This cannot be determined in advance, andso the manager must take a simple random sample over the entire stand, measure the density of newgrowth and the soil pH at each sampling unit, and then post-stratify the data based on measured pH. Thenumber of sampling units in each pH class is not controllable; indeed it may turn out that certain pHclasses have no observations.

If explanatory variables are treated as a auxiliary variables, then there must be a strong relationshipbetween the response and explanatory variables. Additionally, we must be able to measure the auxiliaryvariable precisely for each unit. Then, methods like multiple regression can also be used to investigatethe relationship between the response and the explanatory variable. For example, rather than classifyingelevation into three broad elevation classes or soil pH into broad pH classes, the actual elevation or soil

c©2019 Carl James Schwarz 217 2019-11-03

Page 120: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

pH must be measured precisely to serve as an auxiliary variable in a regression of regeneration densityvs. elevation or soil pH.

If the units have been selected using a simple random sample, then the analysis of the analyticalsurveys proceeds along similar lines as the analysis of designed experiments (Kish, 1987; also refer toChapter 2). In most analyses of analytical surveys, the observed results are postulated to have beentaken from a hypothetical super-population of which the current conditions are just one realization.In the above example, cut blocks would be treated as a random blocking factor; elevation class as anexplanatory factor; and sample plots as samples within each block and elevation class. Hypothesistesting about the effect of elevation on mean density of regeneration occurs as if this were a plannedexperiment.

Pitfall: Any one of the sampling methods described in Section 2 for descriptive surveys can be usedfor analytical surveys. Many managers incorrectly use the results from a complex survey as if the datawere collected using a simple random sample. As Kish (1987) and others have shown, this can lead tosubstantial underestimates of the true standard error, i.e., the precision is thought to be far better thanis justified based on the survey results. Consequently the manager may erroneously detect differencesmore often than expected (i.e., make a Type I error) and make decisions based on erroneous conclusions.

Solution: As in experimental design, it is important to match the analysis of the data with the surveydesign used to collect it. The major difficulty in the analysis of analytical surveys are:

1. Recognizing and incorporating the sampling method used to collect the data in the analysis. Thesurvey design used to obtain the sampling units must be taken into account in much the sameway as the analysis of the collected data is influenced by actual experimental design. A table of‘equivalences’ between terms in a sample survey and terms in experimental design is provided inTable 1.

Table 1Equivalences between terms used in surveys and in experimental design.Survey Term Experimental Design Term

Simple RandomSample

Completely randomized design

Cluster Sampling (a) Clusters are random effects; units within a clustertreated as sub-samples; or(b) Clusters are treated as main plots; units within acluster treated as sub-plots in a split-plot analysis.

Multi-stage sam-pling

(a) Nested designs with units at each stage nested inunits in higher stages. Effects of units at each stage aretreated as random effects, or(b) Split-plot designs with factors operating at higherstages treated as main plot factors and factors operat-ing at lower stages treated as sub-plot factors.

Stratification Fixed factor or random block depending on the reasonsfor stratification.

Sampling Unit Experimental unit or treatment unit

Sub-sample Sub-sample

There is no quick easy method for the analysis of complex surveys (Kish, 1987). The super-population approach seems to work well if the selection probabilities of each unit are known(these are used to weight each observation appropriately) and if random effects corresponding tothe various strata or stages are employed. The major difficulty caused by complex survey designsis that the observations are not independent of each other.

c©2019 Carl James Schwarz 218 2019-11-03

Page 121: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

2. Unbalanced designs (e.g. unequal numbers of sample points in each combination of explanatoryfactors). This typically occurs if post- stratification is used to classify units by the explanatoryvariables but can also occur in pre-stratification if the manager decides not to allocate equal effortin each stratum. The analysis of unbalanced data is described by Milliken and Johnson (1984).

3. Missing cells, i.e., certain combinations of explanatory variables may not occur in the survey. Theanalysis of such surveys is complex, but refer to Milliken and Johnson (1984).

4. If the range of the explanatory variable is naturally limited in the population, then extrapolationoutside of the observed range is not recommended.

More sophisticated techniques can also be used in analytical surveys. For example, correspondenceanalysis, ordination methods, factor analysis, multidimensional scaling, and cluster analysis all searchfor post-hoc associations among measured variables that may give rise to hypotheses for further inves-tigation. Unfortunately, most of these methods assume that units have been selected independently ofeach other using a simple random sample; extensions where units have been selected via a complexsampling design have not yet developed. Simpler designs are often highly preferred to avoid erroneousconclusions based on inappropriate analysis of data from complex designs.

Pitfall: While the analysis of analytical surveys and designed experiments are similar, the strength ofthe conclusions is not. In general, causation cannot be inferred without manipulation. An observed rela-tionship in an analytical survey may be the result of a common response to a third, unobserved variable.For example, consider the two following experiments. In the first experiment, the explanatory variableis elevation (high or low). Ten stands are randomly selected at each elevation. The amount of growth ismeasured and it appears that stands at higher elevations have less growth. In the second experiment, theexplanatory variables is the amount of fertilizer applied. Ten stands are randomly assigned to each oftwo doses of fertilizer. The amount of growth is measured and it appears that stands that receive a higherdose of fertilizer have greater growth. In the first experiment, the manager is unable to say whether thedifferences in growth are a result of differences in elevation or amount of sun exposure or soil qualityas all three may be highly related. In the second experiment, all uncontrolled factors are present in bothgroups and their effects will, on average, be equal. Consequently, the assignment of cause to the fertilizerdose is justified because it is the only factor that differs (on average) among the groups.

As noted by Eberhardt and Thomas (1991), there is a need for a rigorous application of the techniquesfor survey sampling when conducting analytical surveys. Otherwise they are likely to be subject to biasesof one sort or another. Experience and judgment are very important in evaluating the prospects for bias,and attempting to find ways to control and account for these biases. The most common source of bias isthe selection of survey units and the most common pitfall is to select units based on convenience ratherthan on a probabilistic sampling design. The potential problems that this can lead to are analogous tothose that occur when it is assumed that callers to a radio-phone- in show are representative of the entirepopulation.

3.13 References

• Cochran, W.G. (1977). Sampling Techniques. New York:Wiley.One of the standard references for survey sampling. Very technical

• Gillespie, G.E. and Kronlund, A.R. (1999).A manual for intertidal clam surveys, Canadian Technical Report of Fisheries and Aquatic Sci-ences 2270. A very nice summary of using sampling methods to estimate clam numbers.

• Keith, L.H. (1988), Editor. Principles of Environmental Sampling. New York: American ChemicalSociety.

c©2019 Carl James Schwarz 219 2019-11-03

Page 122: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

A series of papers on sampling mainly for environmental contaminants in ground and surfacewater, soils, and air. A detailed discussion on sampling for pattern.

• Kish, L. (1965). Survey Sampling. New York: Wiley.An extensive discussion of descriptive surveys mostly from a social science perspective.

• Kish, L. (1984). On Analytical Statistics from complex samples. Survey Methodology, 10, 1-7.An overview of the problems in using complex surveys in analytical surveys.

• Kish, L. (1987). Statistical designs for research. New York: Wiley.One of the more extensive discussions of the use of complex surveys in analytical surveys. Verytechnical.

• Krebs, C. (1989). Ecological Methodology.A collection of methods commonly used in ecology including a section on sampling

• Kronlund, A.R., Gillespie, G.E., and Heritage, G.D. (1999).Survey methodology for intertidal bivalves. Canadian Technical Report of Fisheries and AquaticSciences 2214. An overview of how to use surveys for assessing intertidal bivalves - more techni-cal than Gillespie and Kronlund (1999).

• Myers, W.L. and Shelton, R.L. (1980). Survey methods for ecosystem management. New York:Wiley.Good primer on how to measure common ecological data using direct survey methods, aerialphotography, etc. Includes a discussion of common survey designs for vegetation, hydrology,soils, geology, and human influences.

• Sedransk, J. (1965b). Analytical surveys with cluster sampling. Journal of the Royal StatisticalSociety, Series B, 27, 264-278.

• Thompson, S.K. (1992). Sampling. New York:Wiley.A good companion to Cochran (1977). Has many examples of using sampling for biologicalpopulations. Also has chapters on mark-recapture, line-transect methods, spatial methods, andadaptive sampling.

3.14 Frequently Asked Questions (FAQ)

3.14.1 Confusion about the definition of a population

What is the difference between the "population total" and the "population size"?

Population size normally refers to the number of “final sampling” units in the population. Populationtotal refers to the total of some variable over these units.

For example, if you wish to estimate the total family income of families in Vancouver, the “final”sampling units are families, the population size is the number of families in Vancouver, and the responsevariable is the income for this household, and the population total will be the total family income overall families in Vancouver.

Things become a bit confusing when sampling units differ from “final” units that are clustered andyou are interested in estimates of the number of “final” units. For example in the grouse/pocket bushexample, the population consists of the grouse which are clustered into 248 pockets of brush. Thegrouse is the final sampling unit, but the sampling unit is a pocket of bush. In cluster sampling, you

c©2019 Carl James Schwarz 220 2019-11-03

Page 123: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

must expand the estimator by the number of CLUSTERS, not by the number of final units. Hence theexpansion factor is the number of pockets (248), the variable of interest for a cluster is the number ofgrouse in each pocket, and the population total is the number of grouse over all pockets.

Similarly, for the oysters on the lease. The population is the oysters on the lease. But you don’trandomly sample individual oysters – you randomly sample quadrats which are clusters of oysters. Theexpansion factor is now the number of quadrats.

In the salmon example, the boats are surveyed. The fact that the number of salmon was measured isincidental - you could have measured the amount of food consumed, etc.

In the angling survey problem, the boats are the sampling units. The fact that they contain anglers orthat they caught fish is what is being measured, but the set of boats that were at the lake that day is ofinterest.

3.14.2 How is N defined

How is N (the expansion factor defined). What is the best way to find this value?

This can get confusing in the case of cluster or multi-phase designs as there are different N ’s at eachstage of the design. It might be easier to think of N as an expansion factor.

The expansion factor will be known once the frame is constructed. In some cases, this can only bedone after the fact - for example, when surveying angling parties, the total number of parties returningin a day is unknown until the end of the day. For planning purposes, some reasonable guess may have todone in order to estimate the sample size. If this is impossible, just choose some arbitrary large number- the estimated future sample size will be an overestimate (by a small amount) but close enough. Ofcourse, once the survey is finished, you would then use the actual value of N in all computations.

3.14.3 Multi-stage vs. Multi-phase sampling

What is the difference between Multi-stage sampling and multi-phase sampling?

In multi-stage sampling, the selection of the final sampling units takes place in stages. For example,suppose you are interested in sampling angling parties as they return from fishing. The region is firstdivided into different landing sites. A random selection of landing sites is selected. At each landing site,a random selection of angling parties is selected.

In multi-phase sampling, the units are NOT divided into larger groups. Rather a first phase selectssome units and they are measured quickly. A second phase takes a sub-sample of the first phase andmeasures more intently. Returning back to the angling survey. A multi-phase design would select anglingparties. All of the selected parties could fill out a brief questionnaire. A week later, a sample of thequestionnaires is selected, and the angling parties RECONTACTED for more details.

The key difference is that in multi-phase sampling, some units are measured TWICE; in multi-phasesampling, there are different sizes of sampling units (landing sites vs. angling parties), but each samplingunit is only selected once.

c©2019 Carl James Schwarz 221 2019-11-03

Page 124: Chapter 3people.stat.sfu.ca/~cschwarz/CourseNotes/... · Farley Mowat, Never Cry Wolf. McLelland and Stewart, 1963. Why would a field biologist in the early post-war period be instructed

CHAPTER 3. SAMPLING

3.14.4 What is the difference between a Population and a frame?

Frame = list of sampling units from which a sample will be taken. The sampling units may not be thesame as the “final” units that are measured. For example, in cluster sampling, the frame is the list ofclusters, but the final units are the objects within the cluster.

Population = list of all “final” units of interest. Usually the “final units” are the actual things measuredin the field, i.e. what is the final object upon which a measurement is taken.

In some cases, the frame doesn’t match the population which may cause biases, but in ideal cases,the frame covers the population.

3.14.5 How to account for missing transects.

What do you do if an entire cluster is “missing”?

Missing data can occur at various parts in a survey and for various reasons. The easiest data tohandle is data ‘missing completely at random’ (MCAR). In this situation, the missing data provides noinformation about the problem that is not already captured by other data point and the ‘missingness’ isalso non-informative. In this case, and if the design was a simple random sample, the data point is justignored. So if you wanted to sample 80 transects, but were only able to get 75, only the 75 transects areused. If some of the data are missing within a transect - the problem changes from a cluster sample to atwo-stage sample so the estimation formulae change slightly.

If data is not MCAR, this is a real problem - welcome to a Ph.D. in statistics in how to deal with it!

c©2019 Carl James Schwarz 222 2019-11-03