stratification case study to illustrate alternative methods to stratify a sampling frame

27
VII-1 VII-1 Stratification Stratification Case study to Case study to illustrate alternative illustrate alternative methods to stratify a methods to stratify a sampling frame sampling frame Dr. Will Yancey, CPA Dr. Will Yancey, CPA material is the property of the presenter and cannot be reproduced or used without the expressed ten consent of the presenter.

Upload: masao

Post on 09-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Stratification Case study to illustrate alternative methods to stratify a sampling frame. Dr. Will Yancey, CPA. This material is the property of the presenter and cannot be reproduced or used without the expressed Written consent of the presenter. Outline. Why stratify? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-1VII-1

StratificationStratificationCase study to illustrate Case study to illustrate alternative methods to alternative methods to

stratify a sampling framestratify a sampling frameDr. Will Yancey, CPADr. Will Yancey, CPA

This material is the property of the presenter and cannot be reproduced or used without the expressedWritten consent of the presenter.

Page 2: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-2VII-2

OutlineOutlineA.A. Why stratify?Why stratify?B.B. Coefficient of Variation (Coefficient of Variation (CVCV))C.C. High and Low ThresholdsHigh and Low ThresholdsD.D. Number of strataNumber of strataE.E. Strata Boundary DeterminationStrata Boundary Determination

Case study data for this presentation:Case study data for this presentation:185,083 rows of purchase invoice line items.185,083 rows of purchase invoice line items.

Page 3: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-3VII-3

A. Why stratify?A. Why stratify?Parable of the Footballs and the FishParable of the Footballs and the Fish You are asked to determine the weight of 1,000 You are asked to determine the weight of 1,000

footballs. You know they are identical in weight. footballs. You know they are identical in weight. You can weigh only one football at a time. How You can weigh only one football at a time. How many must you weigh? many must you weigh?

You are asked to determine the weight of 1,000 You are asked to determine the weight of 1,000 different fish taken from a lake. They are highly different fish taken from a lake. They are highly variable in weight. You can weigh only one fish variable in weight. You can weigh only one fish at a time. How many must you weigh?at a time. How many must you weigh?

Page 4: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-4VII-4

Parable continuedParable continued How could we organize the fish so we How could we organize the fish so we

could get a reasonable estimate could get a reasonable estimate without weighing them all?without weighing them all?

What feature would we use to organize What feature would we use to organize the fish?the fish?

What features would probably not be What features would probably not be useful for estimating total weight?useful for estimating total weight?

How many piles should we have?How many piles should we have?

Page 5: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-5VII-5

Effective StratificationEffective StratificationEffective stratification: If possible, what we are Effective stratification: If possible, what we are

measuring is measuring is similar withinsimilar within each stratum and each stratum and different betweendifferent between strata. strata.

Stratifying (grouping, categorization, segmenting, etc.)Stratifying (grouping, categorization, segmenting, etc.) Grouping by account, type, division, or other attribute.Grouping by account, type, division, or other attribute. Stratifying by dollar amount within group.Stratifying by dollar amount within group.

A sales and use tax audit goal is to estimate total tax A sales and use tax audit goal is to estimate total tax dollar error.dollar error.

Correlation of invoice line amounts with taxability or Correlation of invoice line amounts with taxability or errors:errors: If an error occurs, it is proportional to invoice line amountIf an error occurs, it is proportional to invoice line amount The relative frequency of error occurrence might or might not The relative frequency of error occurrence might or might not

be correlated with invoice amount. be correlated with invoice amount.

Page 6: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-6VII-6

Accounts Payable Case Study DataAccounts Payable Case Study Data185,083 rows of invoice line itemsRange $0.01 to $26,763,476$493 million total population base4% of items with amount ≥ $10,000

Distribution of Item Count

-

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

1 2 3 4 5 6 7 8 9 10

Range in $1,000

Coun

t

Page 7: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-7VII-7

A/P Case Study: Distribution of A/P Case Study: Distribution of $$

Distribution of Base Dollars

-

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

300,000,000

350,000,000

400,000,000

1 2 3 4 5 6 7 8 9 10 11

Range in $1,000

Base

dol

lars

> $10K

4% of items with amount ≥ $10,000 contain

$376 of the $493 million in population base = 76%

Page 8: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-8VII-8

B. Coefficient of Variation B. Coefficient of Variation ((CV CV ))

CVCV is a relative measure of the dispersion is a relative measure of the dispersion around the mean.around the mean.

Dollar stratification results in lower Dollar stratification results in lower CVCV within within each stratum than in the combined each stratum than in the combined unstratified sampling frame.unstratified sampling frame.

Caution: When the mean is close to zero, Caution: When the mean is close to zero, CVCV is very sensitive to small changes.is very sensitive to small changes.

meandeviation standard CV

Page 9: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-9VII-9

CVCV, stratification, and , stratification, and precisionprecision

Reducing Reducing CVCV usually improves precision. usually improves precision.(Remember Parable of Footballs and Fish.)(Remember Parable of Footballs and Fish.)

For each stratum compute the For each stratum compute the CVCV of the items’ of the items’ invoice line amounts.invoice line amounts.

For a specific total sample size and stratified For a specific total sample size and stratified random sampling, the best precision usually random sampling, the best precision usually occurs when the occurs when the CVCV are relatively constant are relatively constant across the strata. across the strata. Consider adjusting strata boundaries or adding Consider adjusting strata boundaries or adding

more strata to adjust more strata to adjust CV CV across the strata.across the strata.

Page 10: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-10VII-10

Case Study: Case Study: Coefficient of Coefficient of VariationVariation

Lower Lower BounBoun

d ≥d ≥

Upper Upper Bound Bound

<<

Size Size (count (count items)items)

Standard Standard Devi-Devi-

ation σ ation σ

Mean μ Mean μ Coefficient Coefficient of of

Variation Variation CV = σ / CV = σ /

μμ

0 0

$27 $27 million million

UnstratifiUnstratified ed

185,08185,083 3

86,331 86,331

2,662,66

3 3 3242%3242%

0 0

1,000 1,000 146,425 146,425 260 260 234 234 111%111%

1,001,00

0 0

2,000 2,000 19,300 19,300 256 256

1,281,28

1 1 20%20%

2,002,000 0

3,000 3,000 3,897 3,897 286 286

2,432,43

5 5 12%12%

Page 11: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-11VII-11

C. High and Low ThresholdsC. High and Low ThresholdsAll items with dollar amount greater than All items with dollar amount greater than

High Threshold (High Threshold (HH) will be detailed ) will be detailed (actual basis exam) rather than sampled.(actual basis exam) rather than sampled.

““This removal of the extremes from the This removal of the extremes from the main body of the population reduces the main body of the population reduces the skewness and improves the normal skewness and improves the normal approximation.” Cochran, approximation.” Cochran, Sampling Sampling Techniques, 3Techniques, 3rdrd Edition Edition, p. 44., p. 44.

Page 12: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-12VII-12

Setting High Threshold (Setting High Threshold (HH))also known as ceiling, detail thresholdalso known as ceiling, detail threshold

Approximately top 0.1% to 0.2% of Approximately top 0.1% to 0.2% of items (or some other %).items (or some other %).

Greater than 3 standard deviations Greater than 3 standard deviations from the unstratified population mean.from the unstratified population mean.

As As HH decreases, the number of items in decreases, the number of items in the detail stratum increases.the detail stratum increases.

Items above Items above HH are from relatively few are from relatively few major vendors or major projects.major vendors or major projects.

Page 13: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-13VII-13

Case Study: High ThresholdCase Study: High Threshold If If HH is: is: Count ≥ Count ≥

HH% %

PopulatioPopulation Size ≥ n Size ≥

HH

Base $ ≥ Base $ ≥ HH % Base $ % Base $ ≥ ≥ HH

1,000,000 1,000,000 4848 0.03%0.03%

128,545,0128,545,0

14 14 26%26%

500,000 500,000 158158 0.09%0.09%

197,766,2197,766,2

92 92 40%40%

250,000 250,000 230230 0.12%0.12%

223,256,3223,256,3

01 01 45%45%

100,000 100,000 370370 0.20%0.20%

242,946,6242,946,6

14 14 49%49%

50,000 50,000 638638 0.34%0.34%

261,323,1261,323,1

66 66 53%53%

Population Size = 185,083. Population Base = $492,953,742.

Exhibits in this presentation: H = $100,000.

Page 14: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-14VII-14

Low Threshold (Low Threshold (LL))also known as Floor or Basementalso known as Floor or Basement

Accounting transaction data files have many Accounting transaction data files have many small dollar items – particularly for small dollar items – particularly for purchases with invoice line items.purchases with invoice line items.

Delivery charges, processing fees, etc.Delivery charges, processing fees, etc.Some sampling plans set a Low Threshold Some sampling plans set a Low Threshold

((LL) such that every item below ) such that every item below LL is: is:a.a. Excluded (no change), Excluded (no change), ororb.b. Minimum sample size, Minimum sample size, ororc.c. Project results from other sampled strata Project results from other sampled strata

onto the stratum below onto the stratum below LL..

Page 15: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-15VII-15

Low Threshold (Low Threshold (LL) - criteria) - criteriaPolicy for setting Policy for setting L L depends on what will depends on what will

be done with items below be done with items below LL..Possible criteria for setting a value for Possible criteria for setting a value for LLa.a. Less than 1% or 2% of population Less than 1% or 2% of population

dollars are below dollars are below LL (or some other %). (or some other %).b.b. Greater than 3 standard deviations Greater than 3 standard deviations

below the unstratified population mean.below the unstratified population mean.c.c. Divide Divide HH by 1,000. by 1,000.

Page 16: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-16VII-16

Case Study: Low ThresholdCase Study: Low Threshold If If LL is: is: Count < Count < LL % Population % Population

Count < Count < LL Base $ < Base $ < LL % Base $ % Base $

< < LL1010

7,320 7,320 4%4% 40,15940,159 0.01%0.01%

2525 19,472 19,472

11%11% 248,458248,458 0.05%0.05%

5050 37,231 37,231

20%20% 887,503887,503 0.18%0.18%

100100 65,128 65,128

35%35% 2,792,3192,792,319 0.57%0.57%

200200 90,275 90,275

49%49% 6,441,5176,441,517 1.31%1.31%Exhibits in this presentation: L = $100.

Page 17: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-17VII-17

D. Number of Sampled D. Number of Sampled StrataStrata

Adding more strata Adding more strata Reduces Reduces CVCV within stratum. within stratum. Minimum sample size per stratum may result in total Minimum sample size per stratum may result in total

sample that exceeds budget.sample that exceeds budget. More than 6 strata probably does not improve More than 6 strata probably does not improve

precision [Neter and Loebbecke, precision [Neter and Loebbecke, Behavior of Major Behavior of Major Statistical Estimators in Sampling Accounting Statistical Estimators in Sampling Accounting Populations, Populations, (AICPA, 1975)].(AICPA, 1975)].

Pragmatic approach: Start with 3 strata and then Pragmatic approach: Start with 3 strata and then add or delete strata as needed to achieve desired add or delete strata as needed to achieve desired precision, precision, CVCV, or other criteria., or other criteria.

Page 18: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-18VII-18

E. Strata Boundary E. Strata Boundary DeterminationDetermination

Precision is a function of strata Precision is a function of strata boundaries combined with other boundaries combined with other attributes in population and the attributes in population and the sampling plan.sampling plan.

Unless otherwise stated, the following Unless otherwise stated, the following case study shows:case study shows:Five strata = 3 sampled strata + Low + HighFive strata = 3 sampled strata + Low + HighLow Threshold (Low Threshold (LL) = 100) = 100High Threshold (High Threshold (HH) = 100,000) = 100,000

Page 19: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-19VII-19

Equal Population SizeEqual Population SizeNearly equal population size in sampled strata 2, 3, Nearly equal population size in sampled strata 2, 3,

and 4and 4Stra-Stra-tumtum

Lower Lower Bound Bound

≥≥

Upper Bound Upper Bound <<

% Pop. Size% Pop. Size % Pop. % Pop. Base $Base $

CVCV

11 0 0 L L = = 100100

35.2%35.2% 0.6%0.6% 61.6%61.6%

22 LL = 100 = 100 288.38 288.38 21.5%21.5% 1.5%1.5% 29.6%29.6%33

288.38 288.38 988.36 988.36 21.8%21.8% 4.7%4.7% 37.6%37.6%

44 988.36 988.36

HH = 100,000 = 100,000 21.3%21.3% 44.0%44.0% 152.7152.7%%

55 HH = = 100,000 100,000

27,000,000 27,000,000

0.2%0.2% 49.3%49.3% 276.6276.6%%

Observe: CV varies greatly across strata 2, 3, and 4.

Page 20: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-20VII-20

Equal Population Base $Equal Population Base $ Nearly equal population base $ in sampled Nearly equal population base $ in sampled

strata 2, 3, and 4strata 2, 3, and 4StratuStratu

mmLower Bound Lower Bound

≥≥Upper Bound Upper Bound

<<% Pop. % Pop.

SizeSize% Pop. % Pop. Base $Base $

CVCV

11 0 0 L L = 100 = 100 35.2%35.2% 0.6%0.6% 61.6%61.6%

22 L L = 100 = 100 5,791.79 5,791.79

58.6%58.6% 16.7%16.7% 115.1%115.1%

33 5,791.79 5,791.79 15,845.95 15,845.95

4.3%4.3% 16.7%16.7% 27.0%27.0%

44 15,845.95 15,845.95 HH = 100,000 = 100,000 1.6%1.6% 16.7%16.7% 56.5%56.5%

55 HH = = 100,000100,000

27,000,0027,000,00

0 0

0.2%0.2% 49.3%49.3% 276.6%276.6%

Observe: CV varies greatly across strata 2, 3, and 4.

Page 21: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-21VII-21

Cumulative Square Root Cumulative Square Root (CSR) Method(CSR) Method

Developed by Tore Dalenius, a Swedish statistician, Developed by Tore Dalenius, a Swedish statistician, in the 1950’s with the warning that it will not do well in the 1950’s with the warning that it will not do well with all distributions.with all distributions.

See numerical example in New York State CAA See numerical example in New York State CAA Manual, Publication 132, Manual, Publication 132, www.tax.state.ny.us/pdf/publications/sales/pub132_1www.tax.state.ny.us/pdf/publications/sales/pub132_1001.pdf001.pdf , pages 17-19. , pages 17-19.

Cumulative square root method can be distorted Cumulative square root method can be distorted when begin from zero and there are lots of small $ when begin from zero and there are lots of small $ items (such as under $10).items (such as under $10). Mitigate by setting Mitigate by setting L L threshold greater than zero.threshold greater than zero.

Page 22: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-22VII-22

Cumulative Square Root with Zero Low Cumulative Square Root with Zero Low ThresholdThreshold

LL = zero. 4 sampled strata. 1 detail stratum. = zero. 4 sampled strata. 1 detail stratum.StraStra

--tumtum

Lower Lower Bound ≥Bound ≥

Upper Bound Upper Bound <<

% Pop. % Pop. SizeSize

% Pop. % Pop. Base Base

$$

CVCV

11 LL = 0 = 0 873.24 873.24 75.2%75.2% 5.6%5.6% 105.6%105.6%22

873.24 873.24 5,221.52 5,221.52 18.3%18.3% 11.1%11.1% 57.6%57.6%

33 5,221.52 5,221.52

16,194.02 16,194.02 4.7%4.7% 17.8%17.8% 29.7%29.7%

44 16,194.016,194.0

2 2

HH = 100,000 = 100,000 1.6%1.6% 16.2%16.2% 56.0%56.0%

55 HH = = 100,000 100,000

27,000,000 27,000,000 0.2%0.2% 49.3%49.3% 276.6%276.6%Observe: CV varies greatly across strata 1, 2, 3, and 4.

Page 23: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-23VII-23

Cumulative Square Root with $100 Low Cumulative Square Root with $100 Low ThresholdThreshold

LL = 100. Between = 100. Between LL and and HH has 3 sampled strata has 3 sampled strataStra-Stra-tumtum

Lower Bound Lower Bound ≥≥

Upper Bound Upper Bound <<

% Pop. % Pop. SizeSize

% Pop. % Pop. Base Base

$$

CVCV

11 0.01 0.01 L L = 100 = 100 35.2%35.2% 0.6%0.6% 61.6%61.6%22 LL = 100 = 100 1,986.90 1,986.90 54.3%54.3% 11.4%11.4% 78.2%78.2%

33 1,986.90 1,986.90 13,036.45 13,036.45 7.8%7.8% 17.2%17.2% 56.6%56.6%

44 13,036.45 13,036.45 HH = 100,000 = 100,000 2.5%2.5% 21.5%21.5% 60.7%60.7%

55 HH = 100,000 = 100,000 27,000,000 27,000,000 0.2%0.2% 49.3%49.3% 276.6%276.6%

Observe: CV is closer across strata 2, 3, and 4.Setting an appropriate L has improved the stratification.

Page 24: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-24VII-24

Geometric Ratio MethodGeometric Ratio Method Developed by Will Yancey with co-Developed by Will Yancey with co-

authors Jane Horgan and Patricia authors Jane Horgan and Patricia Gunning at Dublin City University in Gunning at Dublin City University in Ireland in 2003.Ireland in 2003.

Assumes population distribution declines Assumes population distribution declines at a relatively constant rate.at a relatively constant rate.

Requires setting thresholds Requires setting thresholds LL and and HH..R = H / L = 100,000 / 100 = 1,000R = H / L = 100,000 / 100 = 1,000For J=3 strata: r = R ^ (1/J) = 1,000 ^ (1/3) = 10.0For J=3 strata: r = R ^ (1/J) = 1,000 ^ (1/3) = 10.0For J=4 strata: r = R ^ (1/J) = 1,000 ^ (1/4) = 5.623For J=4 strata: r = R ^ (1/J) = 1,000 ^ (1/4) = 5.623

Page 25: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-25VII-25

Geometric Ratio with 3 sampled Geometric Ratio with 3 sampled stratastrata

Ratio upper to lower boundary is Ratio upper to lower boundary is rr=10 in =10 in strata 2, 3, and 4.strata 2, 3, and 4.

StratumStratum Lower Lower Bound ≥Bound ≥

Upper Upper Bound <Bound <

% Pop. % Pop. SizeSize

% Pop. % Pop. Base Base

$$

CVCV

11 00 LL =100 =100 35.2%35.2% 0.6%0.6% 61.6%61.6%22 LL = 100 = 100 1,0001,000 43.9%43.9% 6.4%6.4% 67.6%67.6%33 1,0001,000 10,00010,000 16.9%16.9% 16.7%16.7% 87.2%87.2%44 10,00010,000 HH = 100,000 = 100,000 3.8%3.8% 27.1%27.1% 65.5%65.5%55 HH

=100,00=100,0000

27,000,00027,000,000 0.2%0.2% 49.3%49.3% 276.6276.6%%

Observe: CV is relatively similar across strata 2, 3, and 4.

Page 26: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-26VII-26

Geometric Ratio with 4 sampled strataGeometric Ratio with 4 sampled strata Ratio upper to lower boundary is Ratio upper to lower boundary is rr=5.623 in strata 2, 3, 4, =5.623 in strata 2, 3, 4,

and 5.and 5.StratuStratu

mmLower Lower

Bound ≥Bound ≥Upper Upper Bound Bound

<<

% Pop. % Pop. SizeSize

% Pop. % Pop. Base Base

$$

CVCV

11 00 LL = 100 = 100 35.2%35.2% 0.6%0.6% 61.6%61.6%22 LL = 100 = 100 562562 33.4%33.4% 3.3%3.3% 48.2%48.2%33 562562 3,1623,162 23.3%23.3% 10.4%10.4% 46.0%46.0%44 3,1623,162 17,78317,783 6.7%6.7% 22.5%22.5% 45.5%45.5%55 17,78317,783 HH

=100,00=100,0000

1.2%1.2% 14.1%14.1% 53.4%53.4%

66 HH =100,000 =100,000 27,000,00027,000,000 0.2%0.2% 49.3%49.3% 276.6%276.6%

Observe: Adding more strata lowers the CV.

Page 27: Stratification Case study to illustrate alternative methods to stratify a sampling frame

VII-27VII-27

Summary of Stratification Summary of Stratification ProceduresProcedures

1.1. Set a High Threshold (Set a High Threshold (HH).).2.2. Set a Low Threshold (Set a Low Threshold (LL).).3.3. Choose number of strata.Choose number of strata.4.4. Set boundaries with a method.Set boundaries with a method.5.5. Compute Compute CVCV in each stratum. in each stratum.6.6. Adjust by changing Adjust by changing LL, , HH, boundaries, , boundaries,

adding or deleting strata.adding or deleting strata.