sarjinder singh

Download Sarjinder singh

If you can't read please download the document

Upload: hina-aslam

Post on 03-Dec-2014

570 views

Category:

Technology


109 download

DESCRIPTION

 

TRANSCRIPT

  • 1. Advanced Sampling Theory with Applications

2. Advanced Sampling Theory with Applications How Michael 'selected' Amy Volume I by Sarjinder Singh St. Cloud State University, Department ofStatistics, St. Cloud, MN, U.S.A. SPRINGER-SCIENCE+BUSINESS MEDIA, B.V. 3. A C.LP. Catalogue record for this book is available from the Library of Congress. ISBN 978-94-010-3728-0 ISBN 978-94-007-0789-4 (eBook) DOI 10.1007/978-94-007-0789-4 Printed on acid-free paper All Rights Reserved 2003 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 2003 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. 4. GoO( .I A _"" Withwhosegrace..' 5. TABLE OF CONTENTS PREFACE 1 BASIC CONCEPTS AND MATHEMATICAL NOTATION 1.0 Introduction 1.1 Population 1.1.1 Finite population 1.1.2 Infinite population 1.1.3 Target population 1.1.4 Study popul ation 1.2 Sample 1.3 Examples of populations and samples 1.4 Census 1.5 Relative aspects of sampling versus census 1.6 Study variable 1.7 Auxiliary variable 1.8 Difference between study variable and auxiliary variable 1.9 Parameter I. I0 Statistic I.I I Statistics 1.12 Sample selection 1.12.1 Chit method or Lottery method 1.12.1.1With replacement sampling 1.12.1.2Without replacem ent sampling 1.12.2 Random number table method 1.12.2.1Remainder method 1.13 Probability sampling 1.14 Probability of selecting a sample 1.15 Popu lation mean/total 1.16 Population moments 1.17 Population standard deviation 1.18 Population coefficient of variation 1.19 Relative mean square error 1.20 Sample mean 1.21 Sample variance 1.22 Estimator 1.23 Estimate 1.24 Sample space 1.25 Univariate random variable 1.25.1 Qualitative random variables XXI 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 4 4 4 4 5 5 6 7 7 8 8 8 8 9 9 9 10 10 10 11 11 6. VIII Advanced sampling theory with applications 1.25.2 Quantitative random variables II 1.25.2.1 Discrete random variable 11 1.25.2.2 Continuous random variable 11 1.26 Probability mass function (p.m.f.) ofa univariate discrete random variable 12 1.27 Probability density function (p.d.f.) of a univariate continuous random variable 12 1.28 Expected value and variance of a univariate random variable 13 1.29 Distribution function of a univariate random variable 13 1.29.1 Discrete distribution function 14 1.29.2 Continuous distribution function 14 1.30 Selection of a sample using known univariate distribution function 15 1.30.1 Discrete random variable 15 1.30.2 Continuous random variable 17 1.31 Discrete bivariate random variable 19 1.32 Joint probability distribution function of bivariate discrete random variables 20 1.33 Joint cumulative distribution function of bivariate discrete random variables 20 1.34 Marginal distributions of a bivariate discrete random variable 20 1.35 Selection of a sample using known discrete bivariate distribution function 20 1.36 Continuous bivariate random variable 21 1.37 Joint probability distribution function of bivariate continuous random variable 21 1.38 Joint cumulative distribution function of a bivariate continuous random variable 22 1.39 Marginal cumulative distributions of bivariate continuous random variable 22 1.40 Selection of a sample using known bivariate continuous distribution function 22 1.41 Properties of a best estimator 24 1.41.1 Unbiasedness 24 1.41.1.1 Bias 28 1.41.2 Consistency 28 1.41.3 Sufficiency 28 1.41.4 Efficiency 29 1.41.4.1 Variance 29 1.41.4.2 Mean square error 29 1.42 Relative efficiency 29 1.43 Relative bias 29 1.44 Variance estimation through splitting 30 1.45 Loss function 31 1.46 Admissible estimator 31 1.47 Sample survey 31 1.48 Sampling distribution 32 1.49 Sampling frame 33 7. Table of contents ix 1.50 Sample survey design 33 1.51 Errors in the estimators 33 1.51.1 Sampling errors 34 1.51.2 Non-sampling errors 34 1.51.2.1Non-response errors 35 1.51.2.2 Measurement errors 35 1.51.2.3 Tabulation errors 35 1.51.2.4 Computational errors 35 1.52 Point estimator 35 1.53 Interval estimator 35 1.54 Confidence interval 35 1.55 Population proportion 38 1.56 Sample proportion 38 1.57 Variance of sample proportion and confidence interval estimates 39 1.58 Relative standard error 50 1.59 Auxiliary information 50 1.60 Some useful mathematical formulae 56 1.61 Ordered statistics 57 1.61.1 Population median 57 [.6 [.2 Population quartiles 58 [.6[.3 Population percentiles 59 1.61.4 Population mode 59 1.62 Definition(s) of statistics 59 1.63 Limitations of statistics 60 1.64 Lack of confidence in statistics 60 1.65 Scope of statistics 60 Exercises 60 Practical problems 63 2 SIMPLE RANDOM SAMPLING 2.0 Introduction 71 2.1 Simple random sampling with replacement 71 2.2 Simple random sampling without replacement 79 2.3 Estimation of population proportion 94 2.4 Searls' estimator of population mean 103 2.5 Use of distinct units in the WR sample at the estimation stage 106 2.5.1 Estimation of mean 107 2.5.2 Estimation of finite population variance 113 2.6 Estimation of total or mean ofa subgroup (domain) ofa population 118 2.7 Dealing with a rare attribute using inverse sampling [23 2.8 Controlled sampling 125 2.9 Determinant sampling 127 Exercises 128 Practical problems 132 8. x Advanced sampling theory with applications 3 l.JSEOF AUXILIARY INFORMATION: SIMPLE RANDOM SAMPLING 3.0 Introduction 137 3.1 Notation and expected values 137 3.2 Estimation of population mean 138 3.2.1 Ratio estimator 138 3.2.2 Product estimator 145 3.2.3 Regression estimator 149 3.2.4 Power transformation estimator 160 3.2.5 A dual of ratio estimator 161 3.2.6 General class of estimators 164 3.2.7 Wider class of estimators 166 3.2.8 Use of known variance of auxiliary variable at estimation stage of population mean 167 3.2.8.1 A class of estimators 167 3.2.8.2 A wider class of estimators 169 3.2.9 Methods to remove bias from ratio and product type estimators 173 3.2.9.1 Quenouille's method 173 3.2.9.2 Interpenetrating sampling method 175 3.2.9.3 Exactly unbiased ratio type estimator 180 3.2.9.4 Unbiased product type estimator 183 3.2.9.5 Class of almost unbiased estimators of population ratio and product 185 3.2.9.6 Filtration of bias 187 3.3 Estimation of finite population variance 191 3.3.1 Ratio type estimator 192 3.3.2 Difference type estimator 197 3.3.3 Power transformation type estimator 198 3.3.4 General class of estimators 199 3.4 Estimation of regression coefficient 203 3.4.1 Usual estimator 203 3.4.2 Unbiased estimator 204 3.4.3 Improved estimators of regression coefficient 207 3.5 Estimation of finite population correlation coefficient 209 3.6 Superpopulation model approach 214 3.6.1 Relationship between linear model and regression estimator 214 3.6.2 Improved estimator of variance of linear regression estimator 217 3.6.3 Relationship between linear model and ratio estimator 221 3.7 Jackknife variance estimator 223 3.7.1 Ratio estimator 223 3.7.2 Regression estimator 226 9. Table of contents xi 3.8 Estimation of population mean using more than one auxiliary variable 229 3.8.1 Multivariate ratio estimator 230 3.8.2 Multivariate regression type estimators 231 3.8.3 General class of estimators 239 3.9 General class of estimators to estimate any population parameter 245 3.10 Estimation of ratio or product of two population means 248 3.11 Median estimation in survey sampling 250 Exercises 257 Practical problems 281 4 USE OF AUXILIARY INFORMATION: PROBABILITY PROPORTIONAL TO SIZE AND WITH REPLACEMENT (PPSWR)SAMPLING 4.0 Introduction 295 4.1 What is PPSWR sampling? 295 4.1.1 Cumulative total method 300 4.1.2 Lahiri's method 303 4.2 Estimation of population total 306 4.3 Relative efficiency of PPSWR sampling with respect to SRSWR sampling 312 4.3.1 Superpopulation model approach 312 4.3.2 Cost aspect 315 4.4 PPSWR sampling: More than one auxiliary variable is available 317 4.4.1 Notation and expectations 318 4.4.2 Class of estimators 319 4.4.3 Wider class of estimators 320 4.4.4 PPSWR sampling with negatively correlated variables 324 4.5 Multi-character survey 326 4.5.1 Study variables have poor positive correlation with the selection probabilities. 326 4.5.1.1 General class of estimators 335 4.5.2 Study variables have poor positive as well as poor negative correlation with the selection probabilities 336 4.6 Concept of revised selection probabilities 339 4.7 Estimation of correlation coefficient using PPSWR sampling 340 Exercises 341 Practical problems 345 10. xii Advanced sampling theory with applications 5 U SE OF AUXILIARY INFORMATION: PROBABILITY PROPORTIONAL TO SIZE AND WITHOUT REPLACEMENT (PPSWOR) SAMPLING 5.0 Introduction 349 5.0.1 Useful symbols 349 5.0.2 Some mathematical relations 349 5.1 Horvitz and Thompson estimator and related topics 351 5.2 General class of estimators 373 5.3 Model based estimation strategies 375 5.3.1 A brief history of the superpopulation model 377 5.3.2 Scott, Brewer and Ho's robust estimation strategy 378 5.3.3 Design variance and anticipated variance of linear regression type estimator 383 5.4 Construction and optimal choice of inclusion probabilities 385 5.4.1 Pareto zrps sampling estimation scheme 386 5.4.2 Hanurav's method 387 5.4.3 Brewer's method 388 5.4.4 Sampford's method 389 5.4.5 Narain's method 390 5.4.6 Midzunc--Sen method 390 5.4.7 Kumar--Gupta--Nigam scheme 391 5.4.8 Dey and Srivastava scheme for even sample size 392 5.4.9 SSS sampling scheme 393 5.4.10 Optimal choice of first order inclusion probab ilities 394 5.5 Calibration approach 399 5.6 Calibrated estimator of the variance of the estimator of population total 409 5.7 Estimation of variance of GREG 413 5.8 Improved estimator of variance of the GREG : The higher level calibration approach 419 5.8.1 Recalibrated estimator of the variance of GREG 424 5.8.2 Recalibration using optimal designs for the GREG 426 5.9 Calibrated estimators of variance of estimator of total and distribution function 428 5.9.1 Unified setup 430 5.10 Calibration of estimator of variance of regression predictor 431 5.10.1 Chaudhuri and Roy's results 433 5.10.2 Calibrated estimators of variance of regression predictor 436 5.10.2.1 Model assisted calibration 436 5.10.2.2Calibration estimators when variance of auxiliary variable is known 440 5.10.2.2.1Each component of Vx is known 441 5.10.2.2.2Compromized calibration 442 5.10.2.3 Prediction variance 444 11. Table of contents Xlll 5.I I Ordered and unordered estimators 444 5.11.1 Ordered estimators 445 5.11.2 Unordered estimators 449 5.12 Rao--Hartley--Cochran (RHC) sampling strategy 452 5.13 Unbiased strategies using IPPS sampling schemes 462 5.13.1 Estimation of population mean using a ratio estimator 462 5.13.2 Estimation of finite population variance 464 5.14 Godambe's strategy: Estimation of parameters in survey sampling 465 5.14.1 Optimal estimating function 470 5.14.2 Regression type estimators 472 5.14.3 Singh's strategy in two-dimensional space 473 5.14.4 Godambe's strategy for linear Bayes and optimal estimation 476 5.15 Unified theory of survey sampling 479 5.15.1 Class of admissible estimators 479 5.15.2 Estimator 479 5.15.3 Admissible estimator 479 5.15.4 Strictly admissible estimator 479 5.15.5 Linear estimators of population total 483 5.15.6 Admissible estimators of variances of estimators of total 485 5.15.6. I Condition for the unbiased estimator of variance 485 5.15.6.2 Admissible and unbiased estimator of variance 485 5.15.6.3 Fixed size sampling design 485 5.15.6.4 Horvitz and Thompson estimator and its variance in two forms 485 5.15.7 Polynomial type estimators 489 5.15.8 Alternative optimality criterion 490 5.15.9 Sufficient statistic in survey sampling 491 5.16 Estimators based on conditional inclusion probabilities 493 5.17 Current topics in survey sampling 494 5.17.1 Surveydesign 495 5.17.2 Data collection and processing 495 5.17.3 Estimation and analysis of data 496 5.18 Miscellaneous discussions/topics 497 5.18.1 Generalized IPPS designs 497 5.18.2 Tam's optimal strategies 498 5.18.3 Use of ranks in sample selection 498 5.18.4 Prediction approach 498 5.18.5 Total of bottom (or top) percentiles of a finite population 499 5.18.6 General form of estimator of variance 499 5.18.7 Poisson sampling 499 5.18.8 Cosmetic calibration 500 5.18.9 Mixing of non-parametric models in survey sampling 501 5.19 Golden Jubilee Year 2003 of the linear regression estimator 504 Exercises 507 Practical Problems 520 12. XIV Advanced sampling theory with applications 6 USE OF AUXILIARY INFORMATION: MULTI-PHASE SAMPLING 6.0 Introduction 529 6.1 SRSWOR scheme at the first as well as at the second phases of the sample selection 530 6.1.0 Notation and expected values 530 6.1.1 Ratio estimator 532 6.1.1.1 Cost function 535 6.1.2 Difference estimator 539 6.1.3 Regression estimator 540 6.1.4 General class of estimators of population mean 541 6.1.5 Estimation of finite population variance 544 6.1.6 Calibration approach in two-phase sampling 545 6.2 Two-phase sampling using two auxiliary variables 549 6.3 Chain ratio type estimators 554 6.4 Calibration using two auxiliary variables 555 6.5 Estimation of variance of calibrated estimator in two-phase sampling: low and higher level calibration 560 6.6 Two-phase sampling using multi-auxiliary variables 563 6.7 Unified approach in two-phase sampling 563 6.8 Concept of three-phase sampling 565 6.9 Estimation of variance of regression estimator under two-phase sampling 567 6.10 Two-phase sampling using PPSWR sampling 572 6.11 Concept of dual frame surveys 576 6.11.1 Common variables used for further calibration of weights 576 6.11.2 Estimation of variance using dual frame surveys 577 6.12 Estimation of median using two-phase sampling 578 6.12.1 General class of estimators 578 6.12.2 Regression type estimator 579 6.12.3 Position estimator 581 6.12.4 Stratification estimator 582 6.12.5 Optimum first and second phase samples for median estimation 584 6.12.5.1 Cost is fixed 584 6.12.5.2 Variance is fixed 584 6.12.6 Kuk and Mak's techn ique in two-phase sampling 584 6.12.7 Chen and Qin technique in two-phase sampling 586 6.13 Distribution function with two-ph ase sampling 588 6.14 Improved version of two-phase calibration approach 590 6.14.1 Improved first phase calibration 590 6.14.2 Improved second phase calibration 592 Exercises 594 Practical problems 612 13. Table of contents xv VOLUME II 7 SYSTEMATIC SAMPLING 7.0 Introduction 615 7.1 Systematic sampling 615 7.2 Modified systematic sampling 620 7.3 Circular systematic sampling 621 7.4 PPS circular systematic sampling 623 7.5 Estimation of variance under systematic sampling 624 7.5.1 Sub-sampling or replicated sub-sampling scheme 625 7.5.2 Successive differences 626 7.5.3 Variance of circular systematic sampling 627 7.6 Systematic sampling in population with linear trend 627 7.6.1 Estimators with linear trend 627 7.6.2 Modification of estimates 629 7.6.3 Estimators based on centrally located samples 631 7.6.4 Estimators based on balanced systematic sampling 633 7.7 Singh and Singh's systematic sampling scheme 635 7.8 Zinger strategy in systematic sampling 637 7.9 Populations with cyclic or periodic trends 638 7.10 Multi-dimensional systematic sampling 639 Exercises 642 Practical problems 646 8 STRATIFIED AND POST-STRATIFIED SAMPLING 8.0 Introduction 649 8.1 Stratified sampling 650 8.2 Different methods ofsample allocation 659 8.2.I Equal allocation 659 8.2.2 Proportional allocation 659 8.2.3 Optimum allocation method 662 8.3 Use of auxiliary information at estimation stage 676 8.3.1 Separate ratio estimator 677 8.3.2 Separate regression estimator 681 8.3.3 Combined ratio estimator 684 8.3.4 Combined regression estimator 688 8.3.5 On degree of freedom in stratified random sampling 693 8.4 Calibration approach for stratified sampling design 696 8.4.1 Exact combined linear regression using calibration 700 8.5 Construction of strata boundaries 70 I 8.5.1 Strata boundaries for proportional allocation 702 8.5.2 Strata boundaries for Neyman allocation 703 8.5.3 Stratification using auxiliary information 708 8.6 Superpopulation model approach 712 8.7 Multi-way stratification 713 14. XVI Advanced sampling theory with applications 8.8 Stratum boundaries for multi-variate populations 8.9 Optimum allocation in multi-variate stratified sampling 8.10 Stratification using two-phase sampling 8.11 Post-stratified sampling 8.11.1 Conditional post-stratification 8.11.2 Unconditional post-stratification 8.12 Estimation of proportion using stratified random samp ling Exercises Practical problems 718 723 726 729 730 731 735 738 748 9 NON-OVERLAPPING, OVERLAPPING, POST, AND ADAPTIVE CLUSTER SAMPLING 9.0 Introduction 765 9.1 Non-overlapping clusters of equal size 766 9.2 Optimum value of non-overlapping cluster size 790 9.3 Estimation of proportion using non-overlapping cluster sampling 792 9.4 Non-overlapping clusters of different sizes 796 9.5 Selection of non-overlapping clusters with unequal probability sampling 805 9.6 Optimal and robust strategies for non-overlapping cluster sampling 808 9.7 Overlapping cluster sampling 812 9.7.I Population size is known 812 9.7.2 Population size is unknown 814 9.8 Post-cluster sampling 8I7 9.9 Adaptive cluster sampling 819 Exercises 820 Practical problems 822 10 MULTI-STAGE,.SUCCESSIVE, AND RE-SAMPLING STRATEGIES 10.0 Introduction 829 10.1 Notation 830 10.2 Procedure for construction of estimators of the tota l 83 I 10.3 Method of calculating the variance ofthe estimators 833 10.3.1 Selection of first and second stage units using SRSWOR sampling 834 10.3.2 Optimum allocation in two-stage sampling 836 10.4 Optimum allocation of sample in three-stage sampling 837 10.5 Modified three-stage sampling 838 10.6 General class of estimators in two-stage sampling 839 10.7 Prediction estimator under two-stage sampling 842 10.8 Prediction approach to robust variance estimation in two-stage cluster sampling 844 15. Table of contents xvii 10.8.1 Royall's technique of variance estimation 10.9 Two-stage sampling with successive occasions 10.9.1 Arnab's successive sampling scheme 10.10 Estimation strategies in supplemented panels 10.1I Re-sampling methods 10.11.1 Jackknife variance estimator 10.11.2 Balanced half sample (BHS) method 10.11.3 Bootstrap variance estimator Exercises Practical problems 846 847 848 865 866 867 871 873 873 887 11 RANDOMIZED RESPONSE SAMPLING: TOOLS FOR SOCIAL SURVEYS 11.0 Introduction 889 11 .1 Pioneer model 889 11.2 Franklin 's model 892 11.3 Unrelated question model and related issues 897 11.3.1 When proportion of unrelated character is known 897 11.3.2 When proportion of unrelated character is unknown 898 11.4 Regression analysis 903 11.4.1 Ridge regression estimator 905 11.5 Hidden gangs in finite populations 907 11.5.1 Two sample method 907 11.5.2 One sample method 911 11.5.3 Estimation of correlation coefficient between two characters of a hidden gang 912 11.6 Unified approach for hidden gangs 916 11.7 Randomized response technique for a quantitative variable 920 11 .8 GREG using scrambled responses 924 11.8.1 Calibration of scrambled responses 925 11.8.2 Higher order calibration of the estimators of variance under scrambled responses 928 11.8.3 General class of estimators 930 11 .9 On respondent's protection: Qualitative characters 930 11.9.1 Leysieffer and Warner's measure 930 11.9.2 Lanke's measure 932 11.9.3 Mangat and Singh's two-stage model 933 11.9.4 Mangat and Singh's two-stage and Warner 's model at equal level of protection 935 11.9.5 Mangat's model 939 11.9.6 Mangat's and Warner 's model at equal level of protection 940 11.10 On respondent's protection: Quantitative characters 942 11 .10.1 Unrelated question model for quantitative data 942 11.10.2 The additive model 943 11 .10.3 The multiplicative model 943 16. XVIII Advanced sampling theory with applications 11.10.4 Measure of privacy protection 944 11.10.5 Comparison between additive and multiplicative models 945 11.11 Test for detecting untruthful answering 949 11.12 Stochastic randomized response technique 951 Exercises 954 Practical problems 972 12 NON-RESPONSE AND ITS TREATMENTS 12.0 Introduction 975 12.1 Hansen and Hurwitz pioneer model 976 12.2 Politz and Simmons model 980 12.3 Horvitz and Thompson estimator under non-response 984 12.4 Ratio and regression type estimators 986 12.4.1 Distribution and some expected values 987 12.4.2 Estimation of population mean 987 12.4.3 Estimation of finite population variance 993 12.5 Calibrated estimators of total and variance in the presence of non-response 1000 12.5.1 Estimation of population total and variance 1000 12.5.2 Calibration estimator for the total 1002 12.5.3 Calibration ofthe estimators of variance 1003 12.5.3.1 PPSWOR Sampling 1005 12.5.3.2 SRSWOR Sampling 1007 12.6 Different treatments of non-response 1009 12.6.1 Ratio method of imputation 1010 12.6.2 Mean method of imputation 1010 12.6.3 Hot deck (HD) method of imputation 1010 12.6.4 Nearest neighbor (NN) method of imputation 1011 12.7 Superpopulation model approach 1013 12.7.1 Different components of variance 1014 12.8 Jackknife technique 1016 12.9 Hot deck imputation for multi-stage designs 1017 12.10 Multiple imputation 1021 12.10.1 Degree offreedom with multiple imputation for small samples 1024 12.11 Compromised imputation 1025 12.11.1 Practicability of compromised imputation 1027 12.11.2 Recommendations of compromised imputation 1027 12.11.3 Warm deck imputation 1028 12.11.4 Mean cum NN imputation 1028 12.12 Estimation of response probabilities 1031 12.13 Estimators based on estimated response probabilities 1033 12.13.1 Estimators based on response probabilities 1035 12.13.2 Calibration of response probabilities 1037 12.13.2.1 Calibrated estimator and its variance 1038 17. Table of contents XIX 12.13.2.2 Estimation of variance of the calibrated estimator 1039 Exercises 1041 Practical problems 1058 13 M ISCELLANEOUS TOPICS 13.0 Introduction 1065 13.1 Estimation of measurement errors 1065 13.1.1 Estimation of measurement error using a single measurement per element 1066 13.1.1.I Model and notation 1066 13.1.1.2 Grubbs' estimators 1066 13.1.2 Bhatia, Mangat, and Morrison's (BMM) repeated measurement estimators 1068 13.1.2.1 Model and notation 1069 13.2 Raking ratio using contingency tables 1073 13.3 Continuous populations 1077 13.4 Small area estimation 1081 13.4.1 Symptomatic accounting techniques 1081 13.4.2 Vital rates method (VRM) 1081 13.4.3 Censu s component method (CCM) 1082 13.4.4 Housing unit method (HUM) 1083 13.4.5 Synthet ic estimator 1083 13.4.6 Composite estimator 1086 13.4.7 Model based techniques 1090 13.4.7.1 Henderson's model 1090 13.4.7.2 Nested error regression model 1093 13.4.7.3 Random regression coefficient model 1095 13.4.7.4 Fay and Herriot model 1097 13.4.8 Further generalizations 1097 13.4.9 Estimation of proportion of a characteristic in small areas of a population 1099 Exercises 1101 Practical problem s 1101 A pPENDIX T ABLES I Pseudo-Random Numbers (PRN) 1105 2 Critical values based on t distribution 11 07 3 Area under the standard normal curve 1109 18. xx Advanced sampling theory with applications POPULATIONS All operating banks: Amount (in $000) of agricultural loans outstanding in different states in 1997 1111 2 Hypothetical situation of a small village having only 30 older persons (age more than 50 years) : Approximate duration of sleep (in minutes) and age (in years) of the persons 1113 3 Apples , commercial crop: Season average price (in $) per pound , by States, 1994-1996 1114 4 Fish caught: Estimated number of fish caught by marine recreational fishermen by species group and year, Atlantic and Gulf coasts, 1992-1995 1116 5 Tobacco: Area (hectares), yield and production (metric tons) in specified countries during 1998 1119 6 Age specific death rates from 1990 to 2065 (Number per 100,000 births) 1123 7 State population projections, 1995 and 2000 (Number in thousands) 1124 8 Projected vital statistics by country or area during 2000 1126 9 Number of immigrants admitted to the USA 1129 BIBLIOGRAPHY 1131 AUTHOR INDEX 1193 HANDY SUBJECT INDEX 1215 ADDITIONAL INFORMATION 1219 19. PREFACE Advanced Sampling Theory with Applications: How Michael 'Selected' Amy is a comprehensive exposition of basic and advanced sampling techniques along with their applications in the diverse fields of science and technology. This book is a multi-purpose document. It can be used as a text by teachers, as a reference manual by researchers, and as a practical guide by statisticians. It covers 1179 references from different research journals through almost 2158 citations across 1248 pages, a large number of complete proofs of theorems, important results such as corollaries, and 335 unsolved exercises from several research papers. It includes 162 solved, data based, real life numerical examples in disciplines such as Agriculture, Demography, Social Science, Applied Economics, Engineering, Medicine, and Survey Sampling. These solved examples are very useful for an understanding of the applications of advanced sampling theory in our daily life and in diverse fields of science. An additional 177 unsolved practical problems are given at the ends of the chapters. University and college professors may find these useful when assigning exercises to students. Each exercise gives exposure to several complete research papers for researchers/students. For example, by referring to Exercise 3.1 at the back of Chapter 3, different types of estimators of a population mean studied by Chakrabarty (1968), Vos (1980), Adhvaryu and Gupta (1983), Walsh (1970), Sahai and Sahai (1985) and Sisodia and Dwivedi (1981) are examined. Thus, this single exercise discusses about six research papers. Similarly, Exercise 5.7 explains the other possibilities in the calibration approach considered by Deville and Sarndal (1992) and their followers. The data based problems show statisticians how to select a sample and obtain estimates of parameters from a given population by using different sampling strategies like SRSWR, SRSWOR, PPSWR, PPSWOR, RHC, systematic sampling, stratified sampling, cluster sampling, and multi-stage sampling. Derivations of calibration weights from the design weights under single phase and two-phase sampling have been provided for simple numerical examples. These examples will be useful to understand the meaning of benchmarks to improve the design weights. These examples also explain the background of well known scientific computer packages like CALMAR, GES, SAS, STATA, and SUDAAN, etc., some of which are very expensive, used to generate calibration weights by most organizations in the public and private sectors. The ideas of hot deck, cold deck, mean method of imputation, ratio method of imputation, compromised imputation, and multiple imputation have been explained with very simple numerical examples. Simple examples are also provided to understand Jackknife variance estimation under single phase, two-phase [or random non-response by following Sitter (1997)] and multi-stage stratified designs. 20. XXII Advanced sampling theory with applications I have provided a summary of my book from which a statistician can reach a fruitful decision by making a comparison in his/her mind with the existing books in the international market. Title s) 4 Dedication 2 Table of contents 14 Preface 8 9 I I 70 13 II 20 2 58 2 66 20 22 19 58 24 3 158 36 68 38 307 61 4 54 9 15 10 84 26 5 180 13 43 15 651 43 6 86 10 29 10 170 21 7 34 8 17 9 72 23 8 116 21 24 19 112 70 9 64 12 11 14 61 57 10 60 3 31 4 162 13 II 86 3 33 5 216 7 12 90 8 24 9 154 28 13 40 6 7 5 100 15 A endix 26 12 Biblio ra h 62 Author Index 22 Subiect Index 4 Related Books 2 24 This book also covers, in a very simple and compact way, many new topics not yet available in any book on the international market. A few of these interesting topics are: median estimation under single phase and two-ph ase sampling, difference between low level and higher level calibration approach, calibration weights and design weights, estimation of parametric function s, hidden gangs in finite populations, compromised imputation, variance estimation using distinct units, general class of estimators of population mean and variance, wider class of estimators of population mean and variance, power transformation estimators, estimators based on the mean of non-sampled units of the auxiliary character, ratio and regression type estimators for estimating finite population variance similar to proposed by Isaki (1982), unbiased estimators of mean and variance under Midzuno 's scheme of sampling, usual and modified jackknife variance estimator, 21. Preface XXIII estimation of regression coefficient, concept of revised selection probabilities, multi-character surveys sampling, overlapping, adaptive, and post cluster sampling, new techniques in systematic sampling, successive sampling, small area estimation, continuous populations, and estimation of measurement errors. This book has 459 tables, figures, maps, and graphs to explain the exercises and theory in a simple way. The collection of 1179 references (assembled over more than ten years from journals available in India, Australia, Canada, and the USA) is a vital resource for researcher. The most interesting part is the method of notation along with complete proofs of the basic theorems. From my experience and discussion with several research workers in survey sampling, I found that most people dislike the form or method of notation used by different writers in the past. In the book I have tried to keep these notations simple, neat, and understandable. I used data relating to the United States of America and other countries of the world, so that international students should find it interesting and easy to understand. I am confident that the book will find a good place and reputation in the international market, as there is currently no book which is so thorough and simple in its presentation of the subject ofsurvey sampling. The objective, style, and pattern of this book are quite different from other books available in the market. This book will be helpful to: ( a ) Graduates and undergraduates majoring in statistics and programs where sampling techniques are frequently used; ( b ) Graduates currently involved in M.Sc. or Ph.D. programs in sampling theory or using sampling techniques in their research; ( c ) Government organizations such as the US Bureau of Statistics, the Statistics Canada, the Australian Bureau of Statistics, the New Zealand Bureau of Statistics, and the Indian Statistical Institute, in addition to private organizations such as RAND and WESTSTAT, etc. In this book I have begun each chapter with basic concepts and complete derivations of the theorems or results. I ended each chapter by filling the gap between the origin of each topic and the recent references. In each chapter I provided exercises which summarize the research papers. Thus this book not only gives the basic techniques of sampling theory but also reviews most of the research papers available in the literature related to sampling theory. It will also serve as an umbrella of references under different topics in sampling theory, in addition to clarifying the basic mathematical derivations. In short, it is an advanced book, but provides an exposure to elementary ideas too. It is a much better restatement of the existing knowledge available in journals and books. I have used data, graphs, tables, and pictures to make sampling techniques clear to the learners. 22. XXIV Advanced sampling theory with applications EXERCISES ,>, At the end of each chapter I have provided exercises and their solutions are given through references to the related research papers. Exercises can be used to clarify or relate the classroom work to the other possibilities in the literature. At the end of each chapter I have provided practical problems which enable students and teachers to do additional exercises with real data. I have taken real data related to the United States of America and many other countries around the world. This data is freely available in libraries for public use and it has been provided in the Appendix of this book for the convenience of the readers. This will be interesting to the international students. NEWTECHNOLOGIES ', ;> ;',.f "S tudy/Variable Auxiliary Variable Cost More Less Effort More Less Sources of availability Current Surveys or Current or Past Survey, Experiments Books or Journals etc. Interest of an investigator More Less Error in measurement More Less Sources of error More Fewer Notation Y X,Z 1.9 PARAMETER An unknown quantity, which may vary over different sets of values forming population is called a parameter. Any function of population values of a variable is called a parameter. It is generally denoted by O. Mathematically, suppose a population n consists of N units and the value of its /" unit is Yi . Then any function of Y; values is a parameter, i.e., Parameter = f(Y1'Y2 ' ....' YN ) . (1.9 .1) For example, if Y; denotes the total life time of the /" bulb, then the average life time of the bulbs produced by the company is a parameter and is given by I Parameter = -(l+Y2+ .... +YN ) . (1.9 .2) N 1.10 STATISTIC A summary value calculated from a sample of observations, usually but not necessarily as an estimator of some population parameter is called a statistic and is generally denoted bye . Mathematically, suppose a sample s consists of n units and the value of the /" unit of the sample is denoted by vt- Any function of vt values will be a statistic, i.e., 29. 4 Advanced sampling theory with applications Statistic = /(YI 'YZ''Yn) (1.10.1) For example, if Yi denotes the total life time of the {1' bulb, then the average life time of the bulbs produced by the company is estimated by the statistic, defined as Statistic = ..!..(YI + yz + .....+Yn) ' (1.10.2) n I:11 STATISTICS Statistics is a science of collecting, analysing and interpreting numerical data relating to an aggregate of individuals or units. 1.12 SAMPLE SELECTION A sample can be selected from a population in many ways. In this chapter, we will discuss only two simple methods of sample selection. As the readers get familiar with sample selection, more complicated schemes will be discussed in following chapters. 1.12:1 CHITMETHODOR UOTTERYMETHOD Suppose we have N = 10,000 blocks in New York City. We wish to draw a sample of n = 100blocks to draw an inference about a character unde r study, e.g., average amount of alcohol used or number of bulbs used in each block produced by a certain company. Assign numbers to the 10,000 blocks and write these numbers on chits and fold them in such way that all chits look identical. Put all the chits in a box. Then there are two possibilities: 1.12.1.1 WITH REPLACEMENT SAMPLING Select one chit out of 10,000 chits in the box and note the number of the block written on it. This is the first unit selected in the sample. Before selecting the second chit, we replace the first chit in the box and mix with the other chits thoroughly. Then select the second chit and note the name of the block written on it. This is called the second unit selected in the sample. Go on repeating the process, until 100 chits have been selected. Note that the chits are selected after replacing the previous chit in the box some chits may be selected more than once. Such a sampling procedure is called Simple Random Sampling With Replacement or simply SRSWR sampling. Let us explain with a few numbers of block s in a population as follows : Suppose a population consists of N = 3blocks , say A, B and C . We wish to draw all possible samples of size n = 2 using SRSWR sampling. The possible ordered samples are: AA, AB, AC, BA, BB, BC, CA, CB, cc. Thus a total 9 samples of size 2 can be drawn from the population of size 3, which in fact is given by 3z = 9. 30. Chapter I: Basic concepts and mathematical notation 5 In general, the total number of samples of size n drawn from a population of size N in with replacement sampling is Nil and is denoted by s(n). Thus s(n) = s", (1.12 .1) Now imagine the situation, 'How many WR samples, each of n = 100blocks, are possible out of N = 10,000blocks?' 1.12.1.2 WITHOUT REPLACEMENT SAMPLING In case of without replacement sampling, we do not replace the chit while selecting the next chit; i.e., the number of chits in the box goes on decreasing as we go on selecting chits. Hence, there is no chance for a chit to be selected more than once. Such a sampling procedure is called Simple Random Sampling and Without Replacement or simply SRSWOR sampling. Let us explain it as follows: Suppose a population consists of N = 3 blocks A, Band C. We wish to draw all possible unordered samples of size n = 2. Evidently, the possible samples are: AB, AC, BC. Thus a total of 3 samples of size 2 can be drawn from the population of size 3, which in fact is given by 3C 2 = 3 . In general, the total number of samples of size n drawn without replacement from a population of size N is given by NCII or Thus () N N! s n = CII = ( ) n! N-n. where n! = n(n-IXn - 2).......2.1, and O! = I. (1.12.2) Now think again, 'How many WOR samples, each of n = 100blocks, are possible out of N = 10,000blocks?' Note that it is a very cumbersome job to make identical chits if the size of the population is very large. In such situations, another method of sample selection is based on the use of a random number table . A random number table is a set of numbers used for drawing random samples. The numbers are usually compiled by a process involving a chance element, and in their simplest form, consist of a series of digits 0 to 9 occurring at random with equal probability. 1.12.2 RANDOM NUMBERTABLE METHOD As mentioned above, in this table the numbers from 0 to 9 are written both in columns and rows. For the purpose of illustrations, we used Pseudo-Random Numbers (PRN), generated by using the UNIF subroutine following Bratley, Fox, 31. 6 Advanced sampling theory with applications and Schrage (1983), as given in Table 1 of the Appendix. We generally apply the following rules to select a sample: Rule 1. First we write all random numbers into groups of columns as already done in Table I of the Appendix. We take as many columns in each group as the number of digits in the population size. Rule 2. List all the individuals or units in the population and assign them numbers 1,2,3,...,N. Rule 3. Randomly select any starting point in the table of random numbers. Write all the numbers less than or equal to N that follow the starting point until we obtain n numbers. If we are using SRSWOR sampling discard any number that is repeated in the random number table. If we are using SRSWR sampling retain the repeated numbers. Rule 4. Select those units that are assigned the numbers listed in Rule 3. This will constitute a required random sample . Let us explain these rules as follows : Suppose we are given a population of N = 225 units and we want to select a sample of say n = 36 units from it. To pickup a random sample of 36 units out of a population of 225 units , use any three columns from the random number table. For example, use column I to 3, 4 to 6, etc., rejecting any number greater than 225 (and also the number 000) . As an example, the following table lists the 36 units selected using SRSWR sampling procedure with the use of Pseudo-Random Numbers (PRN) given in Table 1 of the Appendix. Uiiits selected in the sample 014 049 053 039 196 183 171 225 179 153 142 138 070 083 001 209 222 075 219 092 155 012 099 211 027 039 048 048 080 161 006 059 199 150 025 173 In the case of SRSWOR sampling, the figures 039, 048 would not get repeated; i.e., we would take every unit only once, so we will continue to select two more distinct random numbers as 078 and 163. Although the above method of selecting a sample by using a random number table is very efficient, may make a lot of rejections of the random numbers, therefore we would like to discuss a shortcut method called the remainder method. 1.12.2.LREMAINlfER METHOD Using the above example, if any three digit selected random number is greater than 225 then divide it by 225. We choose the serial number from 1 through 224 corresponding to the remainder when it is not zero and the serial number 225 when the remainder is zero. However, it is necessary to reject the numbers from 901 to 999 (besides 000) in adopting this procedure as otherwise units with serial number 1 to 99 will have a larger probability (5/999) of selection, while those with serial 32. Chapter I: Basic concepts and mathematical notation 7 number 100 to 225 will have probability only equal to 4/999. If we use this procedure and also the same three figure random numbers as given in columns I to 3, 4 to 6, etc., we obtain the sample of units which are assigned numbers given below. Again in SRSWR sampling the number that gives rise to the same remainder are not discarded while in SRSWOR sampling procedure such numbers are discarded. Thus an SRSWR sample is as given below: .... C' , , H Units selected in the sample 138 151 099 025 014 022 197 176 I I 209 042 194 015 049 095 040 027 124 116 097 126 142 073 158 108 053 046 001 207 156 201 027 II I 209 065 184 Note that in the SRSWR sample, only one unit 209 is repeated, thus for SRSWOR sampling, we continue to apply remainder approach until another distinct unit is selected, which is 089 in this case. Further note that the first random number 992 was discarded due to requiremen t of this rule. 1.13 PROBABILITY SAMPLING Probability sampling is any method of selection of a sample based on the theory of probability. At any stage of the selection process, the probability of a given set of units being selected must be known . 1.14 PROBABILITY OF SELECTING A SAMPLE Every sample selected from the population has some known probabil ity of being selected at any occasion. It is generally denoted by the symbol, PI or p(t) . For example the probability of selecting a sample using with replacemen t sampling, PI =1/ N n , t =1,2,..., N n , (1.14.1) and without replacement sampling, PI =1/ NCn , t =1,2,...,NCII (1.14.2) The following table describes the difference between with replacement and without replacement sampling procedures. With replacemen t sampling ' .:I' Without replacement sampl ing Cheaper Costly Few units may be selected more than A unit can get selected only once. once. Less efficient. More efficient. Number of possible samples s(n) =N n Number of possible samples s(n) = NC" Probability of selecting a particular Probability of selecti ng a partic ular sample PI =1/ N" , t =1,2,....,Nil . sample PI =1/N c" , t =1,2,...,N CII Probability of selecting -th unit III a Probability of selecting -th unitI t III a sample Ii =!/N , i =1,2,...,N . sample Ii' =!/N , i =1,2,....,N. 33. 8 Advanced sampling theory with applications 1.15.POPULATION MEAN/TOTAL Let Yi , j = 1,2,....,N, denote the value population mean is defined as - 1 ( ) 1 N Y = - l" + Y2 +....+ YN = - L Y; N N ;= and population total is defined as Y=(l" +Y2 ++YN) = ~Y;= NY . ;= of the ( h unit In a population, then the ( 1.15.1) (1.15 .2) (1.16.1) (1.16.2) The units of measurements of population mean are the same as those for the actual data. For example, if the (h unit, Y; , ';j j , is measured in dollars, then the popul ation mean , Y, has the same units as dollars. 1.16 POPULATION MOMENTS The ,-th order central population moments are defined as 1 N( -)ru ; = -( - ) L Y; - Y , r = 2, 3, ... . N- I ;= If r = 2 then fl 2 represents the second order popul ation mom ent, given by 2 1 N( - 2 u: = Sy = - - L }j - YJ N -1 i=! and is named the population mean square. Note that the pop ulation variance is defined as 0"; = ~ ~(}j - Y~ = (N- I)s;. (1.16.3) N i=l N If the data is in dollars, then the units of measurement of 0";. will be dollars2 . 1.17 POPULATION STANDARD DEVIATION The positive square root of the popu lation variance is called the population standard deviation and it is denoted by O"y . The units of measurements of" will again be the same as that of actual data. For instance, in the above example, the units of measurements of 0"y will be dollars. 1.18 POPULATION COEFFICIENT OF VARIATION The ratio of standard deviation to population mean is called the coe fficient of variation. It is denoted by Cy that is (1.18.1) 34. Chapter I: Basic concepts and mathematical notation 9 Evidently Cy is a unit free number. It is useful to compare the variability in two different populations having different units of measurements, e.g., S and kg. It is also called the relative standard error (RSE). Sometimes we also consider Cy ~Sy /Y. 1.19 RELATIVE MEAN SQUARE ERROR The relative mean square error is defined as variation Cy and is generally written by RMSE. Mathematically 2 2 a y RMSE = Cy = -=T. y Sometimes it is also denote it by rjJ2 . 1.20 SAMPLE MEAN the square of the coefficient of (1.19.1) Let Yi' i = 1,2,...,11, denote the value of the til unit selected in the sample, then the sample mean is defin ed as _ 1 11 Y =- L Yi Il i=l 1.21 SAMPLE VARIANCE The sample variance s~ is defined as 2 1 /l ( 2 S = - - L Yi - YJ . y 11- 1i= 1 (1.20.1) (1.21.1) Remark 1.1. The popul ation mean Y and population van ance a; etc., are unknown quantities (parameters) and can be denoted by the symbol 8. The sample mean Y and sample variance s~ etc., are known after sampling and are called statistic and can be denoted by iJ . Also note that sample standard deviation (or standard error) and sample coefficient of variation can also be defined as Sy = M Sy and Cy = --=-,respe ctively. Note that standard error is a statistic where as standard Y deviation is a parameter. 35. 10 Advanced sampling theory with applications 1.22 ESTIMATOR A statistic 81 obtained from values in the sample s is also called an estimator of the population parameter (). Note that the notation 81 , or 8, or 811 have same meaning. For example the notation YI' or Y, or Yll have the same meani ng, and s;, or S;'(I) have the same meaning. We choose according to our requirements for a given topic or exercise. 1.23 ESTIMATE Any numeric value obtained from the sample information is called the estimate of the population parameter. It is also called a statistic. 1.24 SAMPLE SPACE A sample space is a set of all possible values of a variable of interest. It is denoted by If! or S . For example, if we throw a pair of fair coins, each having two faces, then the sample space will consist of all possible 4 outcomes as: If! = {HH, HT, TH, TT}. A pictoria l representation of such a sample space is given in Figure 1.24.1. Experiment: Toss two coins e50 T 2 x 2 = 4 outcomes Tree diagram: :. (}z, ,(}k . 54. Chapter I: Basic concepts and mathematical notation 29 1.41.4 EFFICIENCY Before defining the term efficiency, we shall discuss two more terms, viz., variance and mean square error of the estimator. 1.41.4.1 VARIANCE The variance of an estimator el of a population parameter 0 is defined as (1.41.7) (1.42.1) It is generally denoted by the symbol aJ . ,--' . , . " , ' " . "" . . . .,~ 1.41.4.2 MEAN SQUARE ERROR The mean square error (MSE) of an estimator el of a parameter 0 is defined as: MSE(el)= E[el - of=v(eJ+ {s(eJ2 (1.41.8) where s(el ) denotes the bias in the estimator el of O. Evidently if B(el )=0 then MSE(el)= v(el ) . Thus if el and e2 are two different estimators of the parameter e then the estimator el is said to be more efficient than the estimator e2 if and only if MSE(e,). 0.3 Q) {Marks" Maiot'";,;Nam e;;; 1 Ruth 92 Math 2 Ryan 97 Math 3 Tim 68 English 4 Raul 62 Math 5 Marla 97 English 6 Erin 68 Math 7 Judv 76 English 8 Trov 75 English 9 Tara 51 Math 10 Lisa 94 Math 11 John 70 Math 12 Cher 89 English 13 Lona 62 Math 14 Gina 63 Math 15 Jeff 48 Math 16 Sara 97 Math I. Compute the following parameters : ( a ) Population mean; ( b ) Population variance; ( c ) Population standard deviation; ( d ) Population mean square error; ( e ) Population coefficient of variation; 2. ( a ) Select an SRSWOR sample of 4 units using Random Number method; ( b ) Estimate the population mean and population total; (c) Compute the variance of the estimator of population mean; ( d ) Estimate s;; (e) Estimate the variance of the estimator of population mean; ( f) Construct 95% confidence interval of the population mean assuming that population mean square is known and sample size is large. Does the population mean falls in it? Interpret it; ( g ) Construct 95% confidence interval assuming that population mean square is unknown and sample size in small. Does the population mean falls in it? Interpret it; 3. (a) Compute the population proportion of major in English students; ( b ) Estimate the proportion of major in English on the basis above sample; ( c ) Compute the variance of the estimator of population proportion; (d) Estimate the variance of the estimator of population proportion; ( e ) Construct 95% confidence interval for the proportion. 68. 95559- (1209)2 ____1'-"6:.....- = 280.26 (parameter). 16-1 Chapter I: Basic concepts and mathematical notation 43 Solution. We have . ~ T. j c~n~iii!.,; hi~ ...;!!~I l''1am ::.: Ruth 92 8464 Ryan 97 9409 Tim 68 4624 Raul 62 3844 Marla 97 9409 Erin 68 4624 Judy 76 5776 Troy 75 5625 Tara 51 2601 Lisa 94 8836 John 70 4900 Cher 89 7921 Lona 62 3844 Gina 63 3969 Jeff 48 2304 Sara 97 9409 I:.it "' Sum;:I: .L 95559. From the population information we have N N N =16, Dj =1209 and Di =95559. i=l i=l 1. ( a ) Population mean: N If: _ ._ I 1209 Y =.!.=L =-- =75.56 (parameter) . N 16 ( b ) Population variance: ( N )2If: IJj2 _ ~ 95559- (1209f (j2 = H N = 16 =262.75 (parameter). y N 16 ( c ) Population standard deviation: (jy =g =.J262.75 =16.20 (parameter) . ( d ) Population mean square error: ( N )2If:N . I IJj2 _ ~ 52 = ;=1 N y N-l 69. 44 Advanced sampling theory with applications (e) Population coefficient of variation: We are using SRSWOR sampling, so S fS2 ~ C - ---.X.. - _v_0yy_ _ ,, 280.26 _ 16.74 - 0 2215 ( t )v - - - - - - -. parame er . . Y Y 75.56 75.56 2. ( a ) Selection of 4 units using SRSWOR sampling ( 1/ = 4 ): Let us start with 1st row and 6th column of the Pseudo-Random Number Table I given in the Appendix. Random Decision: ... Name of the Nu mber R - Rejection, S -- Selection selected student 62 R 77 R 92 R 67 R 53 R 51 R 33 R 07 S Judy 62 R 69 R 76 R 48 R 50 R 88 R 37 R 72 R 63 R 21 R 33 R 25 R 76 R 09 S Tara 43 R 80 R 94 R 62 R 68 R IS S Jeff 42 R 93 R 29 R 01 S Ruth 70. Chapter 1: Basic concepts and mathematical notation 45 So our SRSWOR sample consists of four students ={Judy, Tara, Jeff, Ruth} . Now from the sample we have the following information: I CY c~ame I ~f+"c Yi cCCy y; '~::c Judy 76 5776 Tara 51 2601 Jeff 48 2304 Ruth 92 8464 Sum 267 19145 Thus 11 II= 4 , IYi = 267 and i=1 ( b ) Sample mean: 11 2 IYi = 19145. i=1 11 I y - i-I 267 (stati . )Yt = --- = - = 66.75 statistic II 4 which is an estimate of population mean. N Note that an estimator ofpopulation total Y = If; will be given by i= Yt = N Yt = 16x66.75 = 1068 (statistic). 19145- (267f ___----""4_ = 440.91 (statistic). 4-1II-I ( c ) The variance of the sample mean estimator is given by V(Yt) = (N-IIJS; = (16-4JX280.26 = 52.548 (parameter). Nil 16x4 ( d ) An estimator of Sf, is given by ( I.Yi J2 n 2 i=1 I Yi - -"'----~- 2 i=1 II Sy = (e) An estimator of the variance of the estimator of the population mean is ;;(Y/)= (N -IIJs2 = (16-4JX440.91 = 82.67 (statistic). Nil y 16x4 ( f) Here 95% confidence interval is given by Yt I.96~V(yt), or 66.75 I.96~52.548, or 66.7514.20, or [52.55, 80.20] . Yes, the true population mean Y= 75.26 lies in the 95% confidence interval estimate. The interpretation of95% confidence interval is that we are 95% sure that the true mean lies in these two limits of this interval estimate. Note that interval estimate is a statistic. 71. 46 Advanced sampling theory with applications ( g ) Here 95% confidence interval estimate is given by Yt la/2 (df = n -1).jv(Yt), or 66.75 10.025(df = 3}.j82.67 , or 66.753.182~82.67, or 66.7528.93, or [37.82,95.68] where la/2(df = n -1) = 10.025(df = 3) = 3.182 is taken from Table 2 of the Appendix. Yes, again the true population mean lies in this 95% confidence interval and its interpretation is same as above. Again note that interval estimate is a statistic. 3. ( a) Let us give upper case 'FLAG' of 1 to English majors and 0 to Math major students in the whole population, then we have o 2 3 4 5 6 7 8 9 10 11 12 13 14 15 R an Tim Raul Marla Erin Jud Tro Tara Lisa John Cher Lona Gina Jeff Math En lish Math En lish Math En ish En lish Math Math Math En lish Math Math Math o o o o o o o o o 16 Population Proportion: N L:FLAGi . . . p= i=l = No.ofstudents wIthenghsh maJor =~=O.3125 (parameter). N Total No.of Students 16 72. Chapter I: Basic concepts and mathematical notation 47 ( b ) Let us now give the same lower case 'flag' to students in the sample. r-: ":', Judy English I Tara Math 0 Jeff Math 0 Ruth Math 0 :< J/>",Attt:,:;Y i:":,~ UIIl' I :~ ' " , ' ' >' "/ . Sample proportion: The sample proportion is given by n ~:tlagi A i=1 1 025p=--=-= . . n 4 ( c ) Variance of the estimator ofproportion under SRSWOR sampling is given by (A) (N -n) ( ) (16-4) ( ) Vwor P =-(--)PI-P = ( )xO.3125x 1-0.3125 =0.0429. nN-l 4x16-1 (d) An estimator of variance of the estimator ofproportion under SRSWOR sampling is given by A ( A) (N - n) A( A) (16- 4) 0 2 ( ) 0 0 68vwor p = ( ) p 1- p = ( ) x . 5x 1- 0.25 = . 4 7 . N n-l 164-1 ( e ) A 95% confidence interval estimate for the true population proportion is p 1.96~vwor(P) or 0.25 1.96'/0.04687 , or 0.250.424 , or [- 0.174, 0.674] , or [0.0, 0.674] . Note that a proportion can never be negative, so lower limit has been changed to O. Caution! It must be noted that we have here a very small sample, but in practice when we deal with the problem of estimation of proportion, the minimum sample size of 30 units is recommended from large populations. Note that instead of using 'FLAG' or 'flag', sometimes we assign codes 0 or I directly to the variable Yor X. Example 1.57.3. For the population considered in the previous example: ( a ) John considers a sampling scheme consisting of only 4 samples as follows. f ~l11ple '. ~Nu~ber ' Cher, John, Marla, Sara 0.25 2 3 4 Erin, Jud ,Raul, Tara 0.25 Gina, Lisa, Ruth, Tim 0.25 Jeff, Lona, Ran, Tro 0.25 73. 48 Advanced sampling theory with applications ( b ) Mike considers another sampling plan consisting of 13 samples each of 4 students as given below: ,, )" ."","1, _." .; ) 11':;,'0: "'1,' " ,~f') " " ,.,',.".. .. '" '.' "!'..,,; ,., ' . r:. ,;' ;.,., .'"lII,' ,"'!'c, 1 Cher, Erin, Gina, Jeff 1/13 2 Cher, Erin, Gina, John 1/13 3 Cher, Erin, Gina, Judy 1/13 4 Cher, Erin, Gina, Lisa 1/13 5 Cher, Erin, Gina, Lona 1/13 6 Cher, Erin, Gina, Marla 1/13 7 Cher, Erin, Gina, Raul 1/13 8 Cher, Erin, Gina, Ruth 1/13 9 Cher, Erin, Gina, Ryan 1/13 10 Cher, Erin, Gina, Sara 1/13 II Cher, Erin, Gina, Tara 1/13 12 Cher, Erin, Gina, Tim 1/13 13 Cher, Erin, Gina, Troy 1/13 Let Yt be an estimator of the population mean. Find the following for each one of the above sampling schemes: E(Yt); V(Yt) ; B(Yt); and MSE(Yt) Comment on the statement, 'Mike's sampling scheme is better than John's sampling scheme'. Justify your logic and discuss the relative efficiency. Solution. ( a ) John's sampling plan: 161.03 127.91 13.61 25.60 .,328.18 74. Chapter I: Basic concepts and mathematical notation 49 (_) ~ {_ -}2 1 MSE Yt = L.Pt Yt - Y =- x328 .18=82.045 . t=4 ( b ) Mike's sampling plan: Samole I! ,;i; ~,'sl~~t?t~~~~~,; ,;0:,:",{!'!';~! ";:i{I~~(J?:;){~J 1j't;:iWt ~~F:.1;";;xr,,)[:' 1 0:';~Di~;::i; s:s Zj.'-!;I,Z! ';; 1 89 68 63 48 67.0 1/13 49.269600 73.3164 10 2 89 68 63 70 72.5 1/13 2.308062 9.378906 3 89 68 63 76 74.0 1/13 0.000370 2.441406 4 89 68 63 94 78.5 1/13 20.077290 8.628906 5 89 68 63 62 70.5 1/13 12.384990 25.628910 6 89 68 63 97 79.3 1/13 27.360950 13.597660 7 89 68 63 62 70.5 1/13 12.384990 25.628910 8 89 68 63 92 78.0 1/13 15.846520 5.941406 9 89 68 63 97 79.3 1/13 27.360950 13.597660 10 89 68 63 97 79.3 1/13 27.360950 13.597660 11 89 68 63 51 67.8 1/13 39.303250 61.035160 12 89 68 63 68 72.0 1/13 4.077293 12.691410 13 89 68 63 75 73.8 1/13 0.072485 3.285 156 ii.;;j' !iir;;:i;.~0:ii:i .',! !Surn 962:3 .".lift,!': ; 3;237.80770.0.' 268:769560 where Y = 75.56 . Thus we have _ 13 _ 1 E(Yt )= 'LPtYt = - x962 .3 = 74.02 , t= 1 13 and B(Yt) = E(yt )- Y= 74.02 -75.56 = - 1.54 , V(Yt) = I Pt{Yt - E(Yt)}2 = ~x 237.8077 = 18.2929, t=13 and MSE(Yt) = Ipt ~t - Y}2 = ~x 268.76956 = 20.675. t= 13 Although John 's sampling scheme is less biased, it has too much mean square error compared to Mike's sampling scheme. Thus we shall prefer Mike' s sampling scheme over John's sampling scheme. Also note that the relative efficiency of Mike's sampling scheme over John 's sampling scheme is given by MSE(- ) RE = Yt John X 100 = 82.045 X 100 = 396.83% . MSE(Yt)Mike 20.675 Thus one can say that Mike's sampling plan is almost four times more efficient than John 's sampling scheme. 75. 50 Advanced sampling theory with applications 1.58 RELATIVESTANDARDERROR The relative standard error of an estimator eof population parameter e is defined as the positive square root of the relative variance of the estimator e. Mathematically RSE(e)=~Rv(e) (1.58 .1) where Rv(e)=v(e)/[E(e)f denotes the relative variance of the estim ator e.The another famous name for relative standard error is coefficient of variation. 1.59.AUXILIARYINFORMATION. In many sample surveys, it is possible to collect information abou t some variable(s) in addition to the variable of interest or study variable. The auxiliary information is accurately known from many sources like reference books, journals, administrative records etc. and is cheaper to obtain than the study variable. For example, while estimating the average income of people living in a particular city, the plot area owned by individual may be known from some published sources. Later on we will observe that the known auxiliary information is also helpful in increasing the efficiency of the estimators. Before dealing with two variables, we should be familiar with the following terms. If lj and X i denote the values of /" unit for the study variable Y and auxiliary variable X , then we have : ( a ) The covariance between X and Y is Cov (X, Y) = E[X - E(X )][Y - E(Y)]=E(xr)- E(X )E(Y ) . (1.59 .1) The covariance between X and Y is same as that between Y and X, i.e., Cov(X ,Y)= Cov(Y,X). For example, for SRSWOR sampling, the covariance between X and Y is given by I N( -X -)S xy=-- IXi-X Yi-Y , N -I i=1 - I N - IN where X = - I X i and Y = - I Yi . Note that an unbiased estimator Sxy is given N i= N i=1 by I n ( -X -)sXY = - -1 I Xi - X Yi - Y n - i= I n I n where x =- IXi and y = - I Yi 11 i=l 11 i=1 ( b ) The population correlation coefficient between X and Y is defined as Cov(X,Y) Pxy = ~v(x)~v(y) ' (1.59 .2) For simple random sampling it is given by 76. Chapter I: Basic concepts and mathematical notation 5I (1.59 .3) is defined as Px)' = Sx)'/~S}S.~ . A biased estimator of the corre lation coefficient Px)' ':,y= SXy/~s;s; . (1.59.4) The value of Px)' (or rx)') is a unit free number and it lies in the interval [-1, +1].It is also independent of change of origin and scale of the variables X and Y. The linear relationship can also be seen with the help of scatter diagrams as follows: sexTTER PLOTS y Px)' > 0 o 00 o o o o Y o o o o o Px)' < 0 o o x X As X increases Yalso increases Relationship is positive As X increases Y decreases Relationship is negative y Px)' = +1 Y Px)' = - 1 x X As X increases Yalso increases and all points lie of a straight line Perfect positive relationship As X increases Y decreases and all points lie on a straight line Perfect negative relationship 77. 52 Advanced sampling theory with applications y Pxy = 0 y Pxy maybepositive,negative or zero x o o 000 o X o As X increases Y may increase or decrease (Y do not care X) No relationship As X increases Y first increases and then decreases Sign of relationship is not sure (1.59.5) (1.59.6) Fig. 1.59.1 Scatter plots. Note that a similar scatter plot can be made from sample values to find the sign of sample correlation coefficient rty . ( C ) The population regression coefficient of X on Y is defined as ,B = Cov(X,Y)/V(X). For simple random sampling, it is given by ,B=SXy/S; . A biased estimator of f3 is given by b= sxy/s; (1.59 .7) which in fact represents a change in the study variable Y with a unit change in the auxiliary variable X . Note that sign of ,B (orb)is same as that of PXy(or rxy ) . Example 1.59.1. Consider the following population consisting of five (N = 5) units A , B , C , D, and E , where for each one of the unit in the population two variables Yand X are measured . Units A B C D E I > Yi 9 11 13 16 21 X ' 14 18 19 20 24.. i Find the following parameters: - - 2 2 ( a) Y, X, Sx , Sy , S ty ' Pxy and f3 . 78. Chapter 1: Basic concepts and mathematical notation 53 ( b ) Select all possible SRSWOR samples of n = 3 units. Show that y , x, s.;, s;,S xy are unbiased for Y , X, S; , S;,Sxy, but rxy and b remain biased estimators of P xy and fJ respective ly. ( c ) Compute Cov(y, r) by using definition . ( d ) Compute (1-f) Sxy and comment on it. n Solution. From the complete population information, we have Units -1'j Xj (~. ~ rr (Xi -x:)1"",( -)2 (Xi -x)2 (Y; -rXXi-X)I '~. Yi ;-;X: A 9 14 -5 -5 25 25 25 B 11 18 -3 -1 9 1 3 C 13 19 -1 0 1 0 0 D 16 20 2 1 4 1 2 E 21 24 7 5 49 25 35 Sum "'-70 95 0 O ~ I ~ c 88 , 52 ,Ye Co 65 "-"""~ N N ( a ) From the above table we have LYi =70, LXi =95, so that ;=1 ;= - I N I - I N I Y =-LY;=- x70 =14 , and X=-LX;=- x95 =19 . N ;=1 5 N i=1 5 From the above table, I(Y; - rl = 88, I{x; - xl = 52 and ~(y; -rXx; - x)= 65, ;;) ;; ;= so that 2 2 2 I N ( -) 88 2 I N ( - ) 52 Sy = - - L Y; - Y = - - = 22 , Sx =- - LX; - X = - = 13 , N - I ;= 5-1 N-I;=I 5-1 I N( -X -) 65 Sxy 16.25 Sxy =--LY;-Y X i - X = -=16.25 , Pxy=g= r.;:;-;:;:=0.960 , N - I;=I 5-1 S2S2 ,,13 x22 x y and fJ = Sxy = 16.25 = 1.25 . S2 13 x (b ) Here we have N = 5 and n = 3, so the total number of possib le SRSWOR samples will be 5C3 = 10. Units A B c Sum Continued . 79. 54 Advanced sampling theory with applications Units Sample'z " .,' >"'~ A 9 14 -3 -3.34 9 1I.I6 10.02 B II 18 -I 0.67 I 0.45 -0.67 D 16 20 4 2.67 16 7.13 10.68 Sum 36 '52 0 0.00 26 ,,-1 8.73 20.03 I ~ Units , Sample 3 , ';&"r. A 9 14 -4.67 -4.67 21.81 21.81 21.81 B I I 18 -2.67 -0.67 7.13 0.45 1.79 E 21 24 7.33 5.34 53.73 28.52 39.14 Sum-ll 41' 56. . ~0.01 0.00; ' 82.67 '50.77 ,,,, ;,, 62.74,'; Units .z : Sample 4- ' ii" .' "';~;;,I,; ;,j ' j A 9 14 -3.67 -3.67 13.44 13.44 13.44 C 13 19 0.33 1.33 0.11 1.78 0.44 D 16 20 3.33 2.34 1I.I 1 5.48 7.80 t~ Sum 1':515' ,; w 53 ~ 1.,0:00 24;67J' ;>20!70.a ~i i rI 21~69, Units I f ' ",;' ';} 'fi. Sample '5~j~ i" ,II' ,;: Q / (rp2P) = 0.9036 = 3749.4 '" 3750. Y Y 0.052 x 0.0964 Thus a minimum sample of size II = 3750 fish is required to attain 5% relative standard error of the estimator ofpopulation proportion under SRSWR sampling. Example 2.3.2. A fisherm an visited the Atlantic and Gulf coast and caught 4000 fish one by one . He noted the species group of each fish caught by him and put back that fish in the sea before making the next catch. He observed that 400 fish belong to the group Herrings. ( a ) Estimate the proportion of fish in the group Herrings livin g in the Atlantic and Gulf coast. ( b ) Construct the 95% confidence interval. Solution. We are given 11 =4000 and r = 400 . ( a) An estimate of the proportion of the fish in the Herrings group is given by P =!- = 400 = 0.1. y 1/ 4000 (b ) Under SRSWR sampling an estim ate of the v(p) is given by v(p ,)= P/ly = 0.1x 0.9 = 2.2505x 10-5. ) 11 -1 4000 -1 A (1- a)100% confidence interval for the true proportion Py is given by Py +Za/2~V( Py ) . Thus the 95% confidence interval for the proportion of fish belonging to the Herrings group is given by Py +1.96~v( Py ) , or 0.1+1.96b.2505 x 10-5 , or [0.0907, 0.1092]. Example 2.3.3. The height y of plants in a field is uniformly distributed between 5 cm to 20 em with the probability density function I j(y) = - V 5 < y < 20 . 15 We wish to estimate the proportion of plants with height more than 15 cm, what is the minimum required sample size II to have an accuracy of relative standard error of45%? 123. 98 Advanced sampling theory with applications ( a ) Select a sample of the required size, and estimate the proportion of plants with height more than 15 cm. ( b ) Construct a 95% confidence interval estimate, assuming that your sample size is large, and interpret your results. Solution. We know that if y has uniform distribution function 1 f(y) = - ;f a < y < b b-a Thus the proportion of plants with height more than 15cm is given by 20 20 1 1 5 Py = fJ(y}ty = f-dy =-(20 -15) =- =0.3333, 15 15 15 15 IS and the variance 0";=Py (1- py ) =0.3333(1- 0.3333)=0.2222 . ( a) We need = O.4S, thus the required minimum sample size is given by n ~ 0"; = 0.2222 =9.8 '" 10. 2 i} 0.4S2 x 0.33332 ( b ) We select a with replacement sample of n = 10 units as follows. The cumulative distribution function (c.d.f.) is given by F(y) =ply ~ y]= ~f(y}ty =~~y = (y-S) 5 51S IS which implies that y = lSF(y)+S. We used 4th to 6th columns multiplied by 10-3 of the Pseudo-Random Number (PRN) Table 1 given in the Appendix to select ten values of F(y) and the required sampled values are computed by using the relationship y = ISF(y)+ S as follows: 0.183 0.448 0.171 0.567 0.737 0.856 0.233 0.895 0.263 19.31 7.75 11.72 7.57 13.51 16.06 17.84 8.50 18.43 8.95 o o o o o o Thus an estimate of the proportion Py is given by, 124. Chapter 2: Simple Random Sampling 99 p,= number of 'yes' answers =~ =0.4 } sample size 10 and an estimate of its variance is given by V(Py )= pAI - Py)= 0.4(1 - 0.4) =0.0267 . n - I 10- 1 Thus a 95% confidence interval estimate of the required proportion Py is given by py+1.96~v(py) , or 0.4+1.96~0.0267, or [0.0797, 0.7202] . Case II. When a sample is drawn using SRSWOR sampling, we have the fol1owing theorems. Theorem 2.3.6. The unbiased estimator of the population proportion P; is given by A r r->; where r is the number of units possessing the attribute A. Proof. Obvious. Theorem 2.3.7. The variance of the estimator Py is given by (A) (N- n) Vpy = n(N - 1{ yQy Proof. We know that (-:: ) (N-n) 2 2 I (N 2 -2JVYII =---Sy, where Sy = - - If; - NY . Nn N - I i=l Again we define y = {I if the ;ti' unit possesses the attribute A, I otherwise, and y'2 = {I if the /h unit possesses the attribute A, I otherwise . So S2 = _I_(N _ Np2 )= ~(p _ p2 )= NPyQy = S2 . y N- I A y N-I y y N- I P Hence we have v (p )=N-ns2=N- n x~PQv = (N -n)pQ " y Nn P N/I N - 1 Y. n(N -1) y } which proves the theorem. Theorem 2.3.8. The unbiased estimator of the variance V(Py) is given by A( A ) (N -n) A A V Py = -(- )Pyqy . N /I-I 125. d 2 {I if the /" sampled unit E A,an y . = I 0 otherwise, 100 Advanced sampling theory with appl ications Proof. We will prove that E[v(p.J= v(py ) , that is E[ ( - /I / vqy]= N( -II )pyQy. N /I - I N /I- I N k h (N-II) 2 . bi d esti f N- II 2ow we now t at ---Sy IS an un lase estimator 0 - - Sy . Nil Nil Changi ng y; = {Iif the /" population unit E A, and y;2 = {Iif the i~" poulation unit E A, o otherwise, 0 otherwise, makes (N- II)S2 = (N - Il) PQ Nil y II(N- I) y r: Similarly, if we make the changes v. = {I if the /~ sampled unit E A, o otherwise, then sJ, =-( I )[fYl- IlY;; ]. /I - I i= will reduce to 2_ 1 [. .2]_ 11 [r '2]_ II (. ' 2)_ IlPy(I- Py)_ IlPyCly sp - 1l_I'-IlPy - II-I ; -Py - II-I Py-Py - 11-1 - ~ . Therefore (N-II) 2 _( N-Il JIlPyqy _ (N-Il) . .- - - s - - - - - - - P q Nil p - Nil (II-I) - N(II - I) y y' Hence the theorem. Theorem 2.3.9. Under SRSWOR sampling, while estimating population proportion, the minimum sample size with minimum relative standard error (RSE) equal to is (2.3.6) Proof. The relative standard error of the estimator Py is given by (. ) I (. )I/{ (. )}2 (N - 1I)Py Qy RSE Py ='Jv Py / E Py = ( ) 2 . (2.3.7) II N- I Py Note that we need an estimator Py such that RSE(py ) ~ , which implies that ( N -IJ Q y < II (N -I )~, - , Hence the theorem. 126. Chapter 2: Simple Random Sampling 101 Example 2.3.4. We wish to estimate the proportion of the number of fish in the group Herrings caught by marine recreational fishermen at the Atlantic and Gulf coasts . There are 30,027 fish out of total 311,528 of fish caught during 1995 as shown in the population 4 in the Appendix. What is the minimum number of fish to be selected by SRSWOR sampling to attain the accuracy of relative standard error 5%? Solution. We have P = 30027 = 0.0964 and Qy = 1- Py = 1- 0.0964 = 0.9036. y 311528 ' Thus for = 0.05 , we have n e 2( NQ y 311528xO.9036 =3704.8;::;3705. N -I)Py+Qy 0.052(311528-1)xO .0964 +0.9036 Thus a minimum sample of size n = 3705 fish is required to attain 5% relative standard error of the estimator ofpopulation proportion under SRSWOR sampling. Example 2.3.5. A fisherman visited the Atlantic and Gulf coast and caught 4000 fish. He noted the species group of each fish caught by him. He observed that 400 fish belong to the group Herrings. ( a) Estimate the proportion of fish in the group Herrings living in the Atlantic and Gulf coast. ( b ) Construct the 95% confidence interval. Given: Total number of fish living in the coast = 311528. Solution. We are given N = 311,528, n = 4,000 and r = 400 . ( a ) An estimate of the proportion of the fish in the Herrings group is _ r 400 p =-=-- =0.1. y n 4000 (b) Under SRSWOR sampling, an estimate of the V(Py) is given by v(- ,)= (N -n)Pyqy = (311528-4000)x 0.l xO.9 =2.2216x10-5. r, N n-1 311528 4000-1 A (1 - a)100% confidence interval for the true proportion Py is given by Py +Za/2~V( Py ). Thus the 95% confidence interval for the proportion of fish belonging to Herrings group is given by py+1.96~~{ Py), or 0.1+1.96~2.2216x10-5, or [0.0908,0.1092]. Example 2.3.6. Ina field there are 1,000 plants and the distribution of their height is given by the probability mass function y (em) 50 100 150 200 225 275 p(y) 0.1 0.2 0.3 0.1 0.2 0.1 127. 102 Advanced sampling theory with applications ( a ) Select a random sample of n = 10 units and estimate the proportion of plants with height more than or equa l to 225 ern. ( b ) Construct a 95% confidence interva l, assuming that it is a large sample. Solution. The cumulative distribution functio n F(Y) is given by y(cm}. 50 100 ISO 200 225 275 1'1 (11-1) . v N il J =l Thus we have the following theorem Theorem 2.5.1.3. The variance of the estimator Yv is given by V(Yv)= C~:J (I/- I)/ NI/)s;. Proof. We have v(Yv)= [ E(~)- ~]s;. = [X:(1l-1)/N il - 1/N]s;. = [L~,J(I/-I) -N(I/- I)} / NI/Js-~ = [XIJ (I/-I )/ NI/]s.~ . Hence the theo rem. (2.5.1.4) (2.5.1.5) (2.5.I.6) It is interesting to note that as the sample size 11 drawn with SRSWR sampling approaches to the population of size N, the magnitude of the relative efficiency also increases. Th e reason of increase in the relative efficiency may be that the increase in sample size also increases the probability of repetition of unit s in SRSWR sampling. The relati ve efficiency und er the Feller (1957) distribution is given by V(y ) (N -I)N(I/-I ) RE = - _1/- = --'--r--'----, V(Yv) n{NilJ(I/- I)} J=I which is free from any population parameter but depends upon population size and sample size. Th e following table shows the percent relative efficie ncy of distinct units based estimators with respect to the estimators based on SRSWR sampling for different values of sample sizes n and population sizes N = 10 . 134. (2.5.1.7) (2.5.1 .9) Chapter 2: Simple Random Sampling 109 " " :: BenetitOfuse'oidistinct-units >,J J Sample size ( n ) J 2 3 4 5 6 7 8 9 J(n-l) I I I I I 1 1 1 I 2 2 4 8 16 32 64 128 256 3 3 9 27 81 243 729 2187 6561 4 4 16 64 256 1024 4096 16384 65536 5 5 25 125 625 3125 15625 78125 390625 6 6 36 216 1296 7776 46656 279936 1679616 7 7 49 343 2401 16807 11 7649 823543 5764801 8 8 64 512 4096 32768 262144 2097152 16777216 9 9 81 729 6561 59049 531441 4782969 43046721 Sum I" 45 """ 285 2025 15333' 120825 978405 8080425 67731333 RE = [(N-1)N(n-1 f[n{X1 J(n-I)}] RE '" 100.00 105.26 11 1.11 117.39124.15 131.41 139.23 147.64 Corollary 2.5.1.2. An approximate expression for V(Yv ) valid up to order N - 2 IS in Pathak (1962) as V(yv)= [~ _ _ l + (n -1)]S2 . n 2N 12N2 y Theorem 2.5.1.4. (a) Show that an alternative estimator of the population mean based on distinct units is YI= E(v)Yv+ Xl1- E(V)) (2.5.1.8) where X is a good guess or a priori estimator of the population mean Y. ( b ) If there is no priori information or good guess about the population mean Y, then an alternative estimator of population mean is given by _ v_ Y2= E(v)Yv 1 v where Yv = - LYi is a mean of distinct units in the sample. v i=1 Proof. Let us consider that an estimator of the population mean is given by Ys = Ji(v)Yv+h (v) (2.5.1.10) 135. 110 Advanced sampling theory with applications where I] (v) and h (v) are suitably chosen constants such that ys is an unbiased estimator of Y and its variance is minimum. Now from the property of unbiasedness we have E(ys)= EUi(v)yv+h(v)]= fi(v)Y +h(v)= Y . (2.5.l.l1) This implies that h(v)= [1 - fi(v)]Y. (2.5.l.l2) Evidently the value of h (v) contains the unknown value Y, the exact value of h (v) is not known unless fi(v)=1, which implies I:(v) =O. Thus we chose fi(v) = 1, then h (v) = 0 , which means a better estimator of population mean Y v would be yy = v-I 2.:Yi . In practical situations, sometimes a priori information or i=1 knowledge ofX (say) is available about population mean Y from past surveys or pilot surveys . In such situations, the value of h (v) is given by h(v)= [1- fi(v)]X . (2.5.1.13) Thus if we will chose h (v) as given in (2.5.l.l3), then the bias in the estimator ys will be minimum. Unfortunately, I: (v) depends upon the value of II (v) too. The best method to chose I I(v) is such that the variance of ysis minimum. Now the variance of the estimator y is given bys V(Ys)= E]V2(Y.J+ V]E2(ys) = E]V2Ui(v)yv+ h(v)]+V2ElUi (v)Yv + h(v)] =E{fi2(v{~- ~)s;]+V2[fi(v)y +h(v)] = E{fi2(v{~ - ~)S;]+ V2[Y] =E{fi2(v{~- ~)s;l (2.5.l.l4) The variance of Ys will be minimum if E{fi2(v{~- ~)s; ] is minimum subject to the condition E]Ui(v)] = 1. Then by Schwartz inequality, we have E{fi2(V{~- ~)] ~E{~- ~) . In the above inequality, the equality sign holds if and only if fi(v )=U- ~)/EU- ~)=~(;~;~-:~)J' (2.5.l.l5) (2.5.l.l6) (2.5.l.l7) Thus if we have a priori information X about Y, then an optimum estimator of Y is given by - (Nv)/(N-v) - X[1 (Nv)/(N- v) ] Y] = E[(Nv)/(N- v)JYv + - E[(Nv)/(N - v)} . 136. Chapter 2: Simple Random Sampling III If no such information abou t Y is available, then we have X = 0 and the above estimator reduces to _ (Nv)/(N-v) _ Y2= E[(Nv)/(N _v)Vv . (2.5.1.18) Pathak (1961) has shown that E[( Nv )]=N2 I (PI2...III) (2.5.1 .19) N-v 111=1 N- 1Il where = 11-lIllJ(I-~J"+...+(-lr(IIlJ(l-~J"PI2....m 1 N III N o for III ~ II, otherwise. (2.5.1.20) The relation (2.5.1.20) shows that the estimators YI and Y2 given at (2.5.1.17) and (2.5.1.18) are very difficult to compute for large sample sizes. Now if we ignore the sampling fraction II/ N and hence v/N , then the above estimators, respectively, reduce to YI= E(v)Yv+Xl1- E(V)J (2.5.1.21) and _ v_ Y2= E(v)Yv. Hence the theorem. (2.5.1.22) (2.5.1.24) Theorem 2.5.1.5. Show that if the square of the population coefficient of variation Cf, = sf,/f2exceeds (II-I) , then the estimator Y2= (v/E(v))Yv is more effic ient than Yv' Proof. We know that V(Yv)=[E{~J- ~]sf, . (2.5.1.23) By the definition of variance we have V(Y2) = E,V2(Y2 Iv)+V,E2(Y2 Iv) = EIV2[ E(V)Yv IV]+ VIE2[ E(V)Yv IV] = E{{E(:)}2(~-~)s;]+ v{iJ>1 ~{E~:)J'[E(+ e(;')]+ l:.;j'VI(,) It is very easy to deri ve that 137. 112 Advanced sampling theory with applications E(V) =N[I- (I- ~rJ and E(V2 )=N[I- (I- ~r]+N(N-I{I-2(1- ~r+(1-~Jl therefore ( ) ( 1)11 ( 1)211 { 2)11V(v)=E v2 -{E(v}f =N I-fj _N 2 1- N +N(N- I- N From (2.5.1.24) we have _ S;(N-I) [f ( 1 )" ) f ( 1)" ( 2 )/1)]V(Y2) = 2 1 /I 2 f-1- N -f-2 1- -;; + I- fj N!I-(I- N)) y2 [ ( I )" ( I )211 { 2 )" ]+ II 2 N I- fj -N 21 -fj +N(N -I- N . N 2[1_(I_ ~) ] (2.5.1.25) Now from (2.5.1.23) and (2.5 .1.25), we have V()lv)- V(Y2 ) N - I ( ) l: J II -I _ S2 ) - 1 - Y Nil 2 - 2 = CISy -C2Y (say) . Now the estimator Y2 (2.5.1.26) IS better than Yv if v(Yv)- V()l2 ) (c2/cd. The approximate values of C) and C2 for large populations, correct up to terms of order N - 2 , arc given by C) =_1_+ 5(11- 1) and C, =(n- I) _ (n- IXn-2) 2nN 12nN2 2nN 3nN2 and thus, (C2 /C));:::(n-I) . Hence the theorem. Theorem 2.5.1.6. If squared error be the loss functio n then show that )Iv is adm issib le amongst all functio ns of )Iv and v . Proof. Let I = )I,. + / ()lv, v) be the function of )l and v . Suppose that the estimatorv I is uniformly better than )Iv . Suppose R(l)be the quadratic loss function for the 138. Chapter 2: Simple Random Sampling 113 estimator t. Then the estimator t will be uniformly better than the estimator Yv if R(r)~ vCYv), where V(y,.)= E(y,.- Y) . Also we have R(I) = E~ - rf= E~v - r+ jLY",v)f = E(Yv- yf + E{t(Yv,v)}2 + 2Ef (Yv'v)(yv- y). Thus R(r)~ vLYv) if E~v - rf+ E{jLYv,v)}2+ 2EjLYv,vxYv - y)~ E~v - yf (2.5.1.27) holds for all 1'[,Y2,..., YN . In particular, if 1'[ = Y2 = ... = YN = C (say), where C is an arbitrary cho sen constant, then the relation (2.5.1.27) implies that f(C, v) IS zero, which proves the theorem. 2.5.2 ESTIMATION OF FINITEPOPULATION VARIANCE Consider the problem of estim ation of 2 - I N ( - 2 " = N I Y; - YJ i=1 using distinct units in a sample of II units drawn by using SRSWR sampling. The usual estimator of 0 for i =1,2,..., v and 'I a()F (n - 2), a(jr::0, at))> 0 and i=1 )=! a(k)> 0, for k *' j *' j' = 1,2,...,v . Now if v> I then we have (2.5.2.7) (2.5.2.6) (2.5.2.5) (2.5.2.8) 1 II (n -2) ( 1j a(l) ( I ja(v) [ _. =. ]=Jl2 'I a(I)!a(Z)!....a(v)! IV ..... IV P xI - X(l)' Xz X(J)IT I 1 a(l) 1 a(v) 'II a(I)!a(z~;....a(viN j ....{ N j Pathak (1961) has shown some mathematical relations as: II n! C (n) a(I)!a(2)!.....a(v)! v and I II (n-2) = Cv(n)-Cv(n - I) a(1 )!a(2)I...a(v)! v(v- I) There fore we have [ I ] Cv(n)- Cv(n- I) . . P XI =X(i)' Xz =x(j) T = ( ) ( ) , l*' ) =1,2,...,v . v v - I Cv n Now if v > 1 then we have E[(YI - yz)ZIr]= ~(i) - Y(j)~P[XI =X(i)' Xz =x(j)Ir]. 2 i~j= l 2 On substituting the value of P~tl = X(i),X2 = x(j)ITJ from (2.5.2.8), we obtain E[ (YI- Y2 f IT]= Cv(n )-Cv(n- I) I I &(i)-Y(j)t =cv(n)-Cv(n- I)s;. 2 Cv(n) 2v(v -! h~j=1 Cv(n) Hence the theorem. Example 2.5.2.1. We have selected an SRSWR sample of 20 units from the population I by using the 3rd and 4th columns of the Pseudo-Random Numbers (PRN) given in Table I in the Appendix. The 20 states corresponding to the serial numbers 29,14,47,22,42,23,48,06,07,42,21,31 ,31,36,16,27,10, 18,26 and 48 were selected in the sample . Later on we observed that the states at serial numbers 42, 31 and 48 have been selected more than once. We reduced our sample size by keeping only 17 states in the sample and collected the information about the real estate farm loans in these states. The data so collected has been given below : 140. Chapter 2: Simple Random Sampling 115 14 47 22 42 23 48 06 07 21 31 36 16 27 10 18 26 IN WA MI TN MN WV CO CT MA NM OK KS NE GA LA MT 1213.024 1100.745 323.028 553.266 1354.768 99.277 315.809 7.130 7.590 140.582 612.108 1049.834 1337.852 939.460 282.565 292.965 ( a ) Estimate the average real estate farm loans in the United States using information from distinct units only. ( b ) Estimate the finite population variance of the real estate loans in the US using information from distinct units only. ( c ) Estimate the average real estate loans and its finite population variance by including repeated units in the sample. Comment on the results. Solution. Here n = 20 and v= 17, and on the basis of distinct units information, we have 29 NH 6.044 -560.7820 314476.8000 14 IN 1213.024 646.1977 417571.5000 47 WA 1100.745 533.9187 285069 .2000 22 MI 323.028 -243.7980 59437.6100 42 TN 553.266 -13.5603 183.8816 23 MN 1354.768 787.9417 620852.1000 48 WV 99.277 -467.5490 218602.3000 06 CO 315.809 -251.0170 63009.6800 07 CT 7.130 -559.6960 313259.9000 21 MA 7.590 -559.2360 312745 .2000 31 NM 140.582 -426.2440 181684.2000 Continued...... 141. 116 Advanced sampling theory with applications 36 OK 612.108 45.2817 2050.4330 16 KS 1049.834 483.0077 233296.4000 27 NE 1337.852 771.0257 594480.6000 10 GA 939.460 372.6337 138855.9000 18 LA 282.565 -284 .2610 80804.4800 26 MT 292.965 -273 .8610 75000.0100 I*,],i" }'i'/"'" l'Sum';//':, " ro. '''ro.~ro. eQlilqS()I()OO()'t, ,/ ,,' /~ /d ( a ) An unbiased estimate of the average real estate farm loans in the United States is given by _ I v I 17 9636.047 y = - L y. = - L y . = = 566.826. v vi=1 1 17i=1 1 17 ( b ) An estimator of the finite population variance CT; based on distinct units information is given by 2 [ C)n-I)] 2 Sv = 1- C)n) sd' Now 2 1 V( _)2 3911380 sd = -(-) L y . - y = = 244461.25, v-I i=1 I v 17-1 and C (n) = vn_(v)(V_I)n+.oo+(_I)(V-I)(V )In. v I v-I C)n) = C I7(20) = 1720 - cl7(17_1)20 + C~7(17_2)20 - cj7(17- 3fO + CJ7(17_4)20 - cF(17_5)20 + C~7(17- 6)20 - cj7(17_7)20 + cJ7(17_8)20 - cr(17- 9)20 + c1J(17 _10)20 - cli(17_11)20 + cli(17_12)20 -clI(17_13)20 + cll(17_14)20 - clJ(17_ 15)20+ clJ(17_16)20 = 2.6366 x 1020 , and C)n-I)= CI7(19)= 1719 -CF(17-1)19 +Cf(17-2)19 -Cj7(17-3)19 +CJ7(17 -4)19 -CF(17-5)19 +C~7(17-6)19-Cj7(17-7)19 +CJ7(17- 8)19 -Cr(17-9)19 +cIJ(17-10)19 -cli(17 -11)19 +cl~(17-12)19 -cll(17 _13)19 +ClI(17 -14)19-Cll(17 -15)19 +ClJ(17 _16)19 = 4.4805x 1018. Hence an estimate of the finite population variance is given by 142. Chapter 2: Simple Random Sampling 117 S; = ( I _ 4.4805 x 10 18 ) x 244461.25 = 240307.004 . 2.6366 x 1020 ( c ) From the sample information including repeated units, we have I ~ Random , State Real estateli' . c !){y( ~ ) ( -t ,;~, , t "" , (; Yi - Yn ~', ' Yi - Yn " 1i,ymEer farm loans, fYl"'-I '~t;;; ~ '~ ' 29 NH 6.044 -515.4150 265652.200 14 IN 1213.024 691.5654 478262.700 47 WA 1100.745 579.2864 335572.700 22 MI 323.028 -198.4310 39374.700 42 TN 553.266 31.8074 1011.711 23 MN 1354.768 833.3094 694404.600 48 WV 99.277 -422.1820 178237.300 06 CO 315.809 -205.6500 42291 .760 07 CT 7.130 -514.3290 264533 .900 21 MA 7.590 -513.8690 264060.900 31 NM 140.582 -380.8770 145067.000 36 OK 612.108 90.6494 8217.314 16 KS 1049.834 528.3754 279180.600 27 NE 1337.852 816.3934 666498.200 10 GA 939.460 418.0014 174725.200 18 LA 282.565 -238.8940 57070.150 26 MT 292.965 -228.4940 52209.330 42 TN 553.266 31.8074 1011.711 31 NM 140.582 -380.8770 145067.000 48 WV 99.277 -422.1820 178237.300 ,'" ',L,'" " "" "Sum 10.tl29.172 ::;",:0.00001' ! ;;1 4270686.000~ If the repeated units are included in the sample then an estimate of the average real estate farm loans in the United States is given by _ 1 n 10429.172 Y =- l: y. = =521.4586 n n i= 1 I 20 and an estimate of the finite population variance is given by s 2 =_1_ f( ,_- )2 = 4270686 =224772.9474 . y I y, Yn 20 1n - ,~I - Clearly estimates of the average and finite population variance remain under estimate if repeated units are included in the sample. For details on distinct units one can refer to Raj and Khamis (1958), Pathak (1962) and Pathak (1966) . Some comparisons of SRSWR and SRSWOR sampling schemes have also been considered by several other researchers viz. Deshpande (1980) , Ramakrishnan (1969) and Seth and Rao (1964), and Basu (1958) . 143. 118 Advanced sampling theory with applications 2.6'ESTIMkTION OF K,POPULA Sometime we are interested in estimating the total or mean value of a variable of interest within a subgroup or part of the population. Such a part or subgroup of a population is called the domain of interest. For example, in a state wide survey, a district may be considered as a domain. After completing the survey sampling process from the whole population, one may be interested in estimating the mean or total of a particular subgroup of the population. We are interested in estimating population parameters ofa subgroup of a population. For example "",,;r ;!,. ;. yr"; ~.. ;;.0!0". ;; ~. ; United States population Employed New York Unemployed Retailers Supermarkets All workers in a firm Part time workers Let D be the domain of interest and N D be the number of units in this domain. ND _ 1 ND Let YD = Lf; and YD =- LY; be the total and mean for the domain D ;=1 N D ;=1 respectively. Suppose we selected an SRSWOR sample s of n units from the entire population nand nD ~ n units out of the selected units are from the domain D of interest. In certain situations the value of N D is known and in another situations the value of N D is unknown. We shall discuss the both situations as follows. Define a variable y' = {I if i E D, I 0 if i ~ D. (2.6.1) N Then we have If; = YD = Y (say). i=1 Case 1. When ND is unknown. Then we have the following theorems. Theorem 2.6.1. Under SRS sampling an unbiased estimator of total YD of the subgroup (or domain D) is given by , N n * ~=-IY; . (262) n i=1 . . Proof. Taking expected values on both sides of(2.6.2), we have E(YD)= E( ~i~JY;*J= ~i~1Ek)= ~i~t~/jPrk = Yj)] N n 1 N * N n N * N * ND * = - I - I Yj = - x - I Yj = I Yj = I Yj = YD . ni=J N j=1 n N j=J j=J j=1 Hence the theorem. 144. Chapter 2: Simple Random Sampling 119 Theorem 2.6.2. The variance of the estimator YD under SRSWOR sampling is: V(YD)= N 2( I- f)sb, where Sb=_I_f~Y;'2_N-I(~y;,)2J. (2.6.3) n N -I 1;=1 ;=1 Proof. Obvious from the results ofSRSWOR sampling. (2.6.4) j (J2J2 1 /I '2 - 1 /I , where SD= - - L Y; - n L Y; . n -I ;=1 ;=1 Theorem 2.6.3. An unbiased estimator of the V(YD ) under SRSWOR sampling is given by .(y' )_N 2 (1_ f) 2 v D - SD, n Proof. Obvious. Case II. When ND is known. Here we need the following lemma. (2.6.5) Lemma 2.6.1. If i E SD indicates that the t" unit is in the sample sub group of D then, we have (. ) noPrf E SDInD> 0 = - . ND Proof. We have Pr(i E SD InD > 0) Pr(i E SD, 1I1D) Number of samples of sizes nDwith i E SD = Pr(nD) = Number of samples of sizes nD Number of ways (nD - I) can be chosen from (ND- I) and (n- nD)from (N- ND) Number of ways nD can be chosen from NDand (n- nD) from (N- ND) Then we have the following theorems Theorem 2.6.4. Under SRS sampling a biased estimator of YD is given by YD = lo~: ;~y;' if no > 0, (2.6.6) otherwise. Its relative bias is equal to the negative of the probability that nD is equal to zero. Proof. Here nD is not fixed before taking the sample and therefore is a random variable. Thus we have E(YD)= E1[E2(YDInD)] 145. 120 Advanced sampling theory with applications Now when nD > 0 then we have E(YD) = E2[(YDIno > 0)] = E2[:~ i~/t* Ino > 0] = E2[ ND Ili InD > 0]' where SD indicates sample subgroup of D nt: iESD [ ND ] {I if iESO' =E2 -- "i.JiYi InD > 0 , where Ii = . nDiED 0 otherwise, = ND [ I E2 (Ii inD > 0)Yi] = ND [ I Yi Pr(Ii E SD inD > 0)] no iED no iED = ND [IYiX nD] = Z:Yi =Yo . no iED ND ieO Therefore (, )_!E(Yo Ino >0)=Yo , E Yo - (, ) E Yo Ino = 0 = O. Thus we have E(YD) =E,[E2(YDInD)] = YD Pr[nD > 0]+ OPr(nD =0) =Yo Pr[no > OJ Thus the relative bias in the estimator YD ,when No is unknown, is given by RB(Yo )=E(Yo)- Yo =Yo Pr(no > 0)- Yo = Pr(no > 0)-1 =-Pr(no =0). Yo Yo Hence the theorem. To find the variance of the estimator Yo we need the following lemma. Lemma 2.6.2. Show that E( 1 I oj 1 1- Po where Po __ No .- no> =-+-2-2' no nPo n Po N Proof. We have 1 1 -;;; = n(n: ) = n(n: - Po ) + nPo (2.6.7) 146. (2.6.9) (2.6.10) Chapter 2: Simple Random Sampling 121 Tanking expected values on both sides, we have r PD(I - PD) E(IID-PD) ( 1 J i ll IIE ~ IID>D '"- 1- -'-- ::---- -~r===~="f"'== +---;----'-;-:,-----~:___=__'~ v IIPD IIPD ~PD(I-PD)/II by (2.6.8) Hence the lemma. For more details about the expected values of an inverse random variab le one can refer to Stephen (1945). Theorem 2.6.5. Show that the variance of the estimator YD , when ND is know n, is V(YD)", Pr(IID > O{{PDN2(~_f) + :: (1- PD)}Sb+ Pr(IID = 0)Y5]. Proof. We have V(YD)= E, ~2(YD IIID)]+ VI [E2(YDIn )] . if no > 0, (, ) {YD if liD > 0, and E2 YD jllD = 0 if no = O. if no = 0, Now 1 2 ( J' ND I-~ sb V2(YD[IID)= OIlD ND From (2.6.10) we have (, )- l lNb (1-~JSb if v > 01 [{YD V YD - E, ND + V, o o if no = 0 Now if liD >0] if "o = 0 . (2.6.11) .if liD> 0]=VI[YDI(IID> 0)] =Y5V,[/(IID> 0)]=Y5 Pr(IID > OXI- Pr(IID > 0)) If liD = 0 = Y8 Pr(IID > O)Pr(IID = 0), (2.6.12) 147. (2.6.13) 122 Advanced sampling theory with app lications and if v > 0] If no =0 = min(flf,';'(IID = j)N5 (1 -~)S5+pr(IID = o)0 j=1 no ND min(ND,II )Pr(IID= j) N5 ( liD ) 2 = Pr(1ID > 0) I ( ) 1- - SD j= 1 Pr no > 0 ti D ND min(ND,II) N 2 ( II J2 = Pr(1I D > 0) I PrlllD = j ill D > 0)---.!2... 1- ~ SD j = l no N D = Pr(IID> 0)E{N5( tI~ - ;JS51I1D> o} = N5S5{E(_1 Ino > o)-_I_ }pr(tlD > 0) n ND '" N5S5{_ I_ + 1~ P~ -_I_ }pr(IID> 0), where PD = ND IIPD II PD ND N '" {PDN 2 (1-~)S5 + N5 1~P~ S5} pr(IID> 0), II N II PD On using (2.6.12) and (2.6.13) in (2.6.1I) we have Hence the theorem. Theorem 2.6.6. An estimator for estima