how to fake data if you must department of statistics rachel fewster
TRANSCRIPT
How to Fake Dataif you must
Department of Statistics
Rachel Fewster
Who wants to fake data?
• Electoral finance returns…
• Toxic emissions reports…
• Business tax returns…
Land areas of world countries: real or fake?
Land areas of world countries: real or fake?
123456789
IIIIIIII
III
IIIII
Land areas of world countries: real or fake?
123456789
IIIII
IIIIIIIIIII
123456789
IIIIIIII
III
IIIII
Land areas of world countries: real or fake?
123456789
IIIII
IIIIIIIIIII
123456789
IIIIIIII
III
IIIII
This one seems more
even…This one has as
many 1s as 5-9s
put together!
This one is
right!
Real land areas of world countries
123456789
IIIIIIII
III
IIIII
11 of them begin with
digits 1 – 4…
Only 5 begin with digits
5 – 9…
Friday’s Newspaper:123456789
IIII IIIIIIII IIIIIIIIIIIII
IIIII
10 out of 34 numbers
began with a 1…
None out of 34 began with
a 9!
The Curious Case of the Grimy Log-books
• In 1881, American astronomer Simon Newcomb noticed something funny about books of logarithm tables…
The Curious Case of the Grimy Log-books
The books always seemed grubby on the first pages…
… but clean on the last pages
The first pages are
for numbers beginning
with digits 1 and 2…The last
pages are for
numbers beginning
with digits 8 and 9…
The Curious Case of the Grimy Log-books
People seemed to look up numbers beginning with 1 and 2 more often than they looked up numbers beginning with 8 and 9.
Why?
Because numbers beginning with 1 and 2 are MORE COMMON than
numbers beginning with 8 and 9!!
Newcomb’s Law
American Journal of Mathematics, 1881
30% of numbers begin with a 1 !!
< 5% of numbers begin with a 9 !!
The First Digits…Over 30% of numbers begin with a 1
Only 5% of numbers
begin with a 9
The First Digits…
Numbers beginning with a 1
Numbers beginning with a 9
There is the same “opportunity” for numbers to begin with 9 as with 1 …
but for some reason they don’t!
0.301 = log10(2/1)
0.176 = log10(3/2)0.125 = log10(4/3)
d
d 1log10
Chance of anumber starting with digit d
Reactions to Newcomb’s law
Nothing!
…for 57 years!
Enter Frank Benford: 1938
Physicist with the General Electric Company
Assembled over 20,000 numbers and counted their first digits!
‘A study as wide as time and energy permitted.’
Populations
Numbers from newspapers
Drainage rates of rivers
Numbers from Readers Digest articles
Street addresses of American Men of Science
About 30% begin with a 1 About 5% begin with a 9
Benford gave the ‘law’ its name……but no explanation. Anomalo
us numbers
!!
“…The logarithmic law applies to outlaw numbers that are without known relationship,
rather than to those that follow an orderly course;
and so the logarithmic relation is essentially a Law of Anomalous Numbers.”
Explanations for Benford’s Law
• Numbers from a wide range of data sources have about 30% of 1’s, down to only 5% of 9’s.
• Benford called these ‘outlaw’ or ‘anomalous’ numbers. They include street addresses of American Men of Science, populations, areas, numbers from magazines and newspapers.
• Benford’s ‘orderly’ numbers don’t follow the law – like atomic weights and physical constants
What is the explanation
?
Popular Explanations
• Scale Invariance
• Base Invariance
• Complicated Measure Theory
• Divine choice
• Mystery of Nature
These two say that IF there is a universal law,
it must be Benford’s.
They don’t explain whythere should be a law
to start with!
In a nutshell … If you grab numbers from all over
the place (a random mix ofdistributions), their digit
frequencies ultimately converge to Benford’s Law
Complicated Measure Theory
That’s why THIS works well
It doesn’t explain why street addresses of American Men of Science works well!
It doesn’t reallyexplain WHAT will work well, nor why
The Key Idea…
If a hat is covered
evenly in red andwhite
stripes…
Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon
Photo - Eric Pouhier http://commons.wikimedia.org/wiki/Napoleon
The Key Idea…
… it will behalf red
and half white.
If a hat is covered
evenly in red andwhite
stripes…
The red stripes and the white stripeseven out over the shape of the hat
If the red stripes cover half the base, they’ll cover about half the hat
What if the red stripes cover 30% of the base?
0 0.3 1 1.3 2 2.3 3 3.3 4 4.3 5 5.3 6
Then they’ll cover about 30% of the hat.
What if the red stripes cover precisely fraction 0.301 of the base?
0.301 = log10(2/1)
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
Then they’ll cover fraction ~0.301 of the hat.
Think of X as a random number…
We want the probability that X has first digit = 1
Let the ‘hat’ be a probability density curve for X
Then AREAS on the hat give PROBABILITIES for X
Think of X as a random number…
We want the probability that X has first digit = 1
Let the ‘hat’ be a probability density curve for X
Then AREAS on the hat give PROBABILITIES for X
Pr(1 < X < 5) = 0.95
Area = 0.95 from 1 to 5
Total area = 1
In the same way ….
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
If the red stripes somehow represent the X values with first digit = 1,
and the red stripes have area ~ 0.301,
then Pr(X has first digit 1) ~ 0.301.
So X values with first digit=1 somehow lie on a set of evenly spaced stripes?
Write X in Scientific Notation:
So X values with first digit=1 somehow lie on a set of evenly spaced stripes?
Write X in Scientific Notation:
nrX 10r is
between 1 and
10
n is an integer
For example…
nrX 10r is
between 1 and
10
n is an integer
21024.1124 1106.776
For example…
nrX 10For the first
digit of X, only r
matters!
21024.1124 1106.776
21en exactly wh
1 digit first has
r
X
For example…
nrX 10For the first
digit of X, only r
matters!
21024.1124 1106.776
21en exactly wh
1 digit first has
r
X
1 < r < 2
r > 2
nrX 1021en exactly wh
1 digit first has
r
X
Take logs to base 10…
)10log(loglog nrX Or in other words…
nrX loglog
nrX loglogr is
between 1 and
10
n is an integer
nrX loglogr is
between 1 and
10
2loglog 1log
...when i.e.
21when
1digit first has
r
r
X
n is an integer
nrX loglogr is
between 1 and
10
301.0log 0
...when i.e.
r
n is an integer
2loglog 1log
...when i.e.
21when
1digit first has
r
r
X
nrX loglog
n is an integer301.0log 0
when1digit first has
r
X
X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n
n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X
X from 1 to 2
X from 10 to 20
X from 100 to 200
nrX loglog
n is an integer301.0log 0
when1digit first has
r
X
X has first digit 1 precisely when log(X) isbetween n and n + 0.301 for any integer n
n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X
STRIPES!!
n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
X values with first digit = 1 satisfy:
and so on!
The ‘hat’ is the probability density curve for log(X)
n = 0 : 301.0log0 Xn = 1 : 301.1log1 Xn = 2 : 301.2log2 X
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
X values with first digit = 1 satisfy:
The ‘hat’ is the probability density curve for log(X)
X from 1 to 2
X from 10 to 20
X from 100 to 200
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
So X values with first digit=1 DO lie on evenly spaced stripes, on the log scale!
The PROBABILITY of getting first digit 1 is the AREA of the red stripes,~ approx the fraction on the base, = 0.301.
We’ve done it!
We’ve shown that we really should expect the first digit to be 1 about 30% of the time!
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
The log scale distorts: small numbers (e.g. 100) are stretched out; larger numbers (e.g. 900) are bunched up.The first digit corresponds to regularly spaced stripes on the log scale.
Intuitively…
So the smallest numbers (first digit = 1) are
stretched out, and get the highest probability!
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
We need a lot of stripes to balance out big ones and little ones! We get one stripe every integer…So we need a lot of integers!
When is this going to work?
The distribution of X needs to be
WIDE on the log scale!
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
X ranges from 0 to 6 on the log scale…So it ranges from 1 to 106 on usual scale!
When is this going to work?
1 .. 2 .. Miss a few ... 999,999 .. 1,000,000
0 0.301 1 1.301 2 2.301 3 3.301 4 4.301 5 5.301 6
These are Benford’s ‘Outlaw Numbers’!
All we need is a distribution that is:• WIDE (4 – 6 orders of magnitude or more)• Reasonably SMOOTH …Then the red stripes will even out to cover about 30% of the total area.
In Real Life…
World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…
Wide (9 integers => 9 stripes)
First digits very good fit to Benford!
In Real Life…
World Populations: From 50 for the Pitcairn Islands …To 1.3 x 109 for China…
Electorate populations? From 583,000 to 773,000 in California:
Of course not! All the first
digits are 5, 6, or 7…
The hat has less than one stripe! Benford doesn’t work here.
But naturally occurring populations are a different story!Cities in California:
- from 94 in the city of Vernon…- to 3.9 million in Los Angeles…
Yes! It’s Benford!
Wide enough (5 integers => 5 stripes)
Powerball Jackpots?- from $10 million to $365 million…
Not bad!
Orders of magnitude only 1.5 …
… but sometimes you just hit lucky!Data with kind permission from www.lottostrategies.com
Your tax return….?
If you plan to fake data, you should first check whether it ought to be Benford!
BUT the IRD has a few other tricks up its sleeve too….
To find out more:• A Simple Explanation of Benford’s Law by R. M. Fewster The American Statistician, to appear. PDF fromwww.stat.auckland.ac.nz/~fewster/benford.html
• Judy Paterson’s CMCT course, Term 1 2009: Centre for Mathematical Content in Teaching
Thanks for listening!