how to spot bad data benford’s law

33
How to Spot Bad Data Peter O’Reilly

Upload: others

Post on 21-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Spot Bad Data Benford’s Law

How to Spot Bad Data

Peter O’Reilly

Page 2: How to Spot Bad Data Benford’s Law
Page 3: How to Spot Bad Data Benford’s Law

theory

&

application

Page 4: How to Spot Bad Data Benford’s Law
Page 5: How to Spot Bad Data Benford’s Law
Page 6: How to Spot Bad Data Benford’s Law

Simon Newcomb

Page 7: How to Spot Bad Data Benford’s Law

Frank Benford

Page 8: How to Spot Bad Data Benford’s Law

“mere curious observation”

Page 9: How to Spot Bad Data Benford’s Law
Page 10: How to Spot Bad Data Benford’s Law
Page 11: How to Spot Bad Data Benford’s Law

Chuck Zlotnick/Warner Brothers Pictures

Page 12: How to Spot Bad Data Benford’s Law

𝑃 𝑑 = log10(1 +1

𝑑)

Page 13: How to Spot Bad Data Benford’s Law

First non-zero digit, d Probability according to Benford’s Law, P(d)

1 0.3010

2 0.1761

3 0.1249

4 0.0969

5 0.0792

6 0.0669

7 0.0580

8 0.0512

9 0.0458

Total Sum 1.0000

Page 14: How to Spot Bad Data Benford’s Law

First Significant Digit Law

a.k.a. Benford's Law

Page 15: How to Spot Bad Data Benford’s Law

SIGNIFICANT DIGIT

• All non-zero digits are significant:

1, 2, 3, 4, 5, 6, 7, 8, 9

• Zero digits between non-zero digits

are significant:

305, 6002, 70008

• Leading zeros are never significant:

0.01, 0.000424

• Number with a decimal point, trailing

zeros are significant:

1.01000, 2.200, 36.5400

Page 16: How to Spot Bad Data Benford’s Law

Red digits are significant

4210

505

2190.30

0.09

0.23

Page 17: How to Spot Bad Data Benford’s Law

Data - best application for

• Random sampling

• Large sample size

• Sufficient variability

• No bounded maximum value

• Counting or measuring based

numbers

Page 18: How to Spot Bad Data Benford’s Law

No–Go for

• Sequentially assigned numbers: e.g.

check numbers, invoice numbers,

purchase order numbers

• Where numbers are influenced by

human thought: e.g. psychological

price setting thresholds ($9.99)

• Accounts with a large number of

firm-specific numbers: e.g.

accounts set up to record $10

refunds

• Accounts with a minimum or maximum

Page 19: How to Spot Bad Data Benford’s Law
Page 20: How to Spot Bad Data Benford’s Law

=LEFT(text,[num_chars])

LEFT returns the first

character or characters in a

text string, based on the

number of characters you

specify.

Page 21: How to Spot Bad Data Benford’s Law
Page 22: How to Spot Bad Data Benford’s Law

=COUNTIF(range, criteria)

COUNTIF function counts the

number of cells within a

range that meet a single

criterion that you specify.

Page 23: How to Spot Bad Data Benford’s Law
Page 24: How to Spot Bad Data Benford’s Law

First non-zero digit, d Probability according to Benford’s Law, P(d)

1 0.3010

2 0.1761

3 0.1249

4 0.0969

5 0.0792

6 0.0669

7 0.0580

8 0.0512

9 0.0458

Total Sum 1.0000

Page 25: How to Spot Bad Data Benford’s Law

Digit Count Actual Frequency Expected Frequency

1 1,402 29.40% 30.10%

2 909 19.06% 17.61%

3 587 12.31% 12.49%

4 459 9.63% 9.69%

5 382 8.01% 7.92%

6 285 5.98% 6.69%

7 281 5.89% 5.80%

8 258 5.41% 5.12%

9 205 4.30% 4.58%

Totals 4,768 100% 100%

Page 26: How to Spot Bad Data Benford’s Law
Page 27: How to Spot Bad Data Benford’s Law

values generated using Excel’s RAND() function

Page 28: How to Spot Bad Data Benford’s Law

SQL Example (database query)

SELECT

LEFT(deposit_amount,1)

AS Digit,

COUNT(LEFT(deposit_amount,1))

AS Digit_Count

FROM

revenue_tax_collection

GROUP BY

LEFT(deposit_amount,1)

ORDER BY 1;

Page 29: How to Spot Bad Data Benford’s Law

Recap

1. =LEFT()

2. =COUNTIF()

3. Plot bar chart

Page 30: How to Spot Bad Data Benford’s Law

Further considerations

• 2nd significant digit

• Chi-Square Test 2

• Not absolute proof

Page 31: How to Spot Bad Data Benford’s Law

Peter O’Reilly, MBA, CMFO, CTC, QPARed Bank CFO, former Jersey City

Treasurer, Pension Actuary, Finance I.T.

Definitive Guide to Local Public Finance in

New Jersey, 2019 publication, available at:

njcmfo.com

[email protected]

Page 32: How to Spot Bad Data Benford’s Law
Page 33: How to Spot Bad Data Benford’s Law

References of copyright and public domain image to

comply with the respective terms of public use:

{source, image description (slide deck page)}

• pixabay.com

• “FAKE” (2), Fraud (2), file cabinets (4), thumbs up

(17, 25), thumbs down (18, 26), curved arrow(28),left

arrow (19), finger counting (21)

• wikipedia.com

• Islamic Republic or Iran flags and presidential

candidates (8), Simon Newcomb (6)

• s9.com

• Frank Benford (7)

• Microsoft.com (https://www.microsoft.com/en-us/legal/intellectualproperty/permissions/default.aspx)

• Microsoft Excel logo (5, 19, 21)

• Chuck Zlotnick/Warner Brothers Pictures, https://www.thewrap.com/accountant-

adds-up-real-review-ben-affleck/

• The Accountant movie screen shot (11)