a pplying b enford ’ s l aw of l eading d igits to l arge, n atural d ata s ets

24
APPLYING BENFORD’S LAW OF LEADING DIGITS TO LARGE, NATURAL DATA SETS

Upload: thomas-stone

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

APPLYING BENFORD’S LAW OF LEADING DIGITS TO LARGE, NATURAL DATA SETS

Page 2: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

BACKGROUND OF BENFORD’S LAW

Discovered by Simon Newcomb in 1881 and again by Frank Benford in 1938, Benford’s Law of Leading Digits suggests that in a majority of real-life data sets, the leading digits of data entries are logarithmically, rather than uniformly, distributed.

As such, in a base 10 number system we would expect to observe a leading digit of 1 about 30.1% of the time, whereas we would expect to observe a leading digit of 9 only about 4.5% of the time.

Page 3: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

INTRODUCTION TO BENFORD’S LAW For any positive real number x, we can

represent x in scientific notation asx = MB(x) ∙ Bk(x)

where MB(x) is called the mantissa of x, and k(x) represents the exponent value.

Benford’s Law of Leading Digits provides us with an expected distribution of the mantissas in a natural data set. According to the law, the probability of observing a data entry beginning with digit d in base B is

Prob(first digit is d) = logB(1+1/d).

Page 4: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

BENFORD BASE-10 DISTRIBUTION

Leading Digit Benford Base-10 Probability

1st Digit 2nd Digit 3rd Digit 4th Digit

0 ----------- 0.11968 0.10178 0.10018

1 0.30103 0.11389 0.10138 0.10014

2 0.17609 0.10882 0.10097 0.10010

3 0.12494 0.10433 0.10057 0.10006

4 0.09691 0.10031 0.10018 0.10002

5 0.07918 0.09668 0.09979 0.09998

6 0.06695 0.09337 0.09940 0.09994

7 0.05799 0.09035 0.09902 0.09990

8 0.05115 0.08757 0.09864 0.09986

9 0.04576 0.08500 0.09827 0.09982

For a base 10 number system, we expect to see the following distribution of digits in a data set that satisfies Benford’s Law.

Page 5: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

APPLICATIONS OF BENFORD’S LAW Benford’s Law has been a useful tool in detecting

fraud and data irregularities in the past due to the fact that humans are notoriously bad random number generators. Discrepancies from the Benford distribution may suggest issues in data validity, such as inconsistencies in data collection methods, rounding errors, or even nefarious activities such as fraud or data distortion.

Knowing that the leading digits of the mantissas should be logarithmically distributed (becoming gradually more uniform the further out we progress), we can compare combinations of the first digits and last digits to the Benford and Uniform distributions, respectively, to judge conformity to the expected digit distribution.

Page 6: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

COMPARISON OF NATURAL DATA SETS In this study, we compare two natural data sets and

their conformity to the Benford distribution.

The first data set is an example of a natural data set with an extremely good fit for the Benford distribution. This data set is made up of hydrology and streamflow statistics from the U.S. Geological Survey.

The second is a much more controversial data set that demonstrates considerable discrepancies from the digit distribution we would expect if it were truly Benford. This data set was derived from a paper published by Phil Jones and Michael Mann, two of the researchers accused of data distortion in the 2009 Climategate Scandal.

Page 7: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

ISSUES ARISING FROM BENFORD ANALYSIS It is still an open question as to which data sets

should conform to Benford’s Law. In general, it suffices for the data set to be large, span multiple orders of magnitude, and have a sufficient number of significant digits. However, it is still possible for the data to fail to be Benford without any nefarious activity despite having met these conditions.

Though the Chi-square statistic is the most popular and well-documented statistic, we must take into account the extreme sensitivity of the Chi-square statistic when dealing with large data sets that have few degrees of freedom (in cases such as these, the Chi-square statistic tends to overestimate the error). For comparison’s sake, we include these values, but rely primarily on the mean absolute deviation (percent deviation from the intended distribution) for our analysis.

Page 8: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

INTENTIONS

The primary goal of this study is to get a sense of when Benford’s Law should hold for natural data sets. As such, discrepancies from Benford’s Law need not indicate fraud or nefarious activity, and it is not our intent to accuse anyone of such behavior; our goal is to see whether or not certain data sets follow Benford’s Law, and comment on the results.

Page 9: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSBACKGROUND AND DATA DESCRIPTION

This data set is comprised of streamflow statistics from the U.S. Geological Survey.

The characteristics of this data make it perfect for a Benford analysis: The data spans a time period of 130 years. The data set is the largest analyzed in Benford

literature to date. The data set spans nine orders of magnitude. The methods employed to measure stream flow

have not changed at all during the time period, suggesting that there will be no distortions due to data collection method changes.

Page 10: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSDESCRIPTION OF BENFORD TESTS

In a previous study by Miller-Nigrini (2008), a first two-digit analysis was performed on this data set, displaying a very close fit to the Benford distribution.

In this study, we analyzed the distribution of the first three-digits. The statement of Benford’s Law may be revised as follows to predict the probability of observing a data entry beginning with a pre-determined three-digit combination:

Prob(first three digits are d1d2d3) =

log10[1+1/(100d1+10d2+d3)]

Page 11: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSRESTRICTING THE DATA SET Due to potential rounding discrepancies, we only wanted to

include numbers with at least four significant digits so that the first three would be unaltered by rounding. However, by pruning our data set to include only values for which we could trust the first three digits, we were limiting ourselves to a mere 16.1% of our original 457,440 data entries, existing in only one order of magnitude. This resulted in a strange, non-Benford distribution.

Having restricted our data set so thoroughly, we could not conclude that our data set was truly non-Benford. Therefore, we decide to ignore the limitation on significant digits and perform a Benford analysis on a larger portion of the data set under the assumption that any rounding errors would “smooth out” over such a large data set. This enabled us to look at a data set with over 400,000 entries spanning six orders of magnitude.

We compare and contrast these two subsets of the hydrology statistics in the following slides.

Page 12: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSCOMPARISON OF RESTRICTED VS. UNRESTRICTED DATA

Restricted First 3-DigitsUnrestricted First 3-Digits

In this comparison, we can see that the unrestricted data set displays a much better conformance to the Benford distribution (shown in pink). This may be attributed in part to the difference in size, but is primarily due to the presence of data entries spanning an extra five orders of magnitude in the unrestricted data as opposed to the restricted set.

Page 13: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSMEASURING CONFORMANCE TO BENFORD’S LAW

The following table reports the Chi-square and absolute mean deviation values for Benford tests of the first, first two, and first three (both restricted and unrestricted) digits. Again, we treat the significant Chi-square values with caution, as several of our data sets contain over 400,000 values.

Benford Test Chi-Square Mean Absolute Deviation

First Digit 45.82 0.00086

First Two-Digits 178.74 0.00017

First Three-Digits (Restricted) 12054.70 0.00039

First Three-Digits (Unrestricted)

23345.30 0.00020

Page 14: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

HYDROLOGY STATISTICSCONCLUSIONS

Though both data sets display a relatively good fit for the Benford distribution, we notice that ignoring our initial limitation on the number of significant digits (thereby giving us a much larger data set spanning five additional orders of magnitude) gives us a better value for the mean absolute deviation. Our unrestricted data set is a mere 0.02% away from what we would expect in a Benford distributed data set.

Restricted Unrestricted

Size of Data Set 73,828 446,055

Orders of Magnitude 1 6

# Significant Digits ≥4 ≥3

Mean Abs. Deviation

0.00039 0.00020

Page 15: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATABACKGROUND AND DATA DESCRIPTION

A massive email leak at the Climatic Research Unit in November 2009 led to allegations of scientific misconduct and data distortion in the climate science community. The scandal soon earned the title “Climategate”.

The data set analyzed was comprised of data from a paper published by Phil D. Jones and Michael Mann (two of the researchers accused in the scandal) in 2004, titled “Global Surface Temperatures Over the Past Two Millennia”.

The data set contains a total of 32,451 observations (measured as deviations from an average temperature); this data set was further broken down into 30 data subsets (ranging in size from 335 to 1991 entries) covering different regions of the world.

Page 16: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATADESCRIPTION OF BENFORD TESTS

Because these data entries were measured as deviations from an average temperature, the option of a first-digit analysis was discarded (due to the presence of so many data entries beginning with 0).

Instead, the last two digits were analyzed in four different Benford tests: Endings: Compares each of the 100 possible last two digit

combinations to the expected uniform probability, 1/100. Non/Doubles: Compares the proportion of total non-double

endings to 9/10 and the proportion of total double digit endings to 1/10.

Non/Doubles(Split): Compares the proportion of total non-double endings to 9/10 and the proportion of each double digit ending to 1/100.

Doubles(Conditional): Evaluates the double digit ending proportions conditionally (given that a double occurs), comparing each double digit ending combination to 1/10 of the total double digit endings.

Page 17: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATADISTRIBUTION OF DOUBLE-DIGIT ENDING COMBINATIONS

In an amalgamation of all 30 data subsets, we observe a significant spike of values ending in the double digit ending combination 77, and a deficit of values ending in 00.

Page 18: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATAANALYSIS OF CLIMATE DATA AMALGAMATION

In an analysis of all 32,451 data entries, we see a 3.93% deviation from our expected Uniform distribution in the Doubles (Conditional) test.

An issue that arises in the climate data analysis is the fact that with only three significant digits, we would not expect the last two digit distribution to be entirely uniform, as we have not progressed far enough out in the mantissa to ensure uniformity. We should see a distribution that is slightly biased toward lower values, though less so than a first digit Benford distribution.

Statistic Benford Last Two-Digits Test

Endings Non/Doubles

Non/Doubles(S)

Doubles(C)

Chi-Square 7288.75 6.34 594.49 613.89

Mean Abs. Dev.

0.00424 0.00419 0.00387 0.03927

Page 19: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATAANALYSIS OF INDIVIDUAL DATA SUBSETS

Due to large discrepancies in the number of times that particular ending digit combinations were observed, we chose to analyze each of the thirty data subsets individually. This uncovered a number of subsets with ending digit distributions that seemed to be outside the realm of random chance. We include the double digit ending statistics for two of these strange subsets (the Western US Unsmoothed and Tasmania Unsmoothed data sets) below:

Data Set

#Entries

00

11

22

33

44

55

66

77

88

99

#Doubles

West. US 1781 4 6 4 5 1 8 0 38 0 24 90

Tasmania

1991 57 80 64 57 0 0 0 0 0 0 258

In our next few slides, we provide an example of the analysis that was performed on each of the thirty data subsets, using the data from these two subsets.

Page 20: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATAANALYSIS OF SUBSETS – WESTERN US UNSMOOTHED If this data subset were truly Benford, we would expect

to see a slight bias toward the lower double digit combinations, and a more uniform decrease as the ending digit combinations increase. When the last two-digit analysis was expanded to include all 100 possible ending combinations, we observed a random scattering of large numbers of occurrences interspersed with 17 ending combinations that did not occur a single time.Statistic Benford Last Two-Digits Test

Endings Non/Doubles

Non/Doubles(S) Doubles(C)

Chi-Square 1399.07 48.42 125.23 152.00

Mean Abs. Dev.

0.00771 0.04947 0.01169 0.09778

In addition to the non-Benford pattern of ending digit combinations, we have a 9.78% deviation from the distribution of double-digit endings (Doubles(C)) that we would expect if this subset were Benford.

Page 21: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATAANALYSIS OF SUBSETS – TASMANIA UNSMOOTHED As seen in the previous table, this data subset

demonstrates a strong bias towards lower double-digit ending combinations. Originally, we suspected that this anomaly may be due to a lack of range (i.e. if our range covered only the interval [0,0.4], we would not expect to observe any ending combinations above 40). However, our range covers the interval [-4.43,3.59], and includes ending digit combinations ranging from 00 to 99.

An expansion of the analysis to include all 100 possible two-digit ending combinations demonstrates an impressive lack of pattern, with 46 ending combinations being observed not a single time, while others are observed as many as 80 times. A sample of this data is included below: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14

57 0 0 72 2 0 79 0 49 2 0 80 6 0 66

Page 22: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATAANALYSIS OF SUBSETS –TASMANIA UNSMOOTHED (CONT.)

Though the majority of the last two-digit tests reveal only a 1 or 2% discrepancy from the expected distribution, our Doubles (Conditional) test reports a deviation of 12.00% from the Uniform distribution that we would expect if this data set were truly Benford.

Statistic Benford Last Two-Digits Test

Endings Non/Doubles Non/Doubles(S) Doubles(C)

Chi-Square 3261.49 19.36 538.58 400.68

Mean Abs. Dev. 0.01134 0.02958 0.01629 0.12000

Page 23: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CLIMATE DATACONCLUSIONS

An identical analysis performed on all thirty data subsets revealed multiple cases of disparities from the uniform last two-digit distribution to which they were compared. Even with significant deviations in several of the subsets, we would expect that if the data were truly Benford, these discrepancies would smooth out in an amalgamation of all 32,451 values.

As mentioned previously, we currently have no way of determining if this data set should conform to Benford’s Law. Though the data set is large, spans multiple orders of magnitude, and reports data values to three significant digits, it is still entirely possible that the deviations are due to non-nefarious factors, such as rounding errors, discrepancies in data collection methods, or simply non-Benford behavior.

Page 24: A PPLYING B ENFORD ’ S L AW OF L EADING D IGITS TO L ARGE, N ATURAL D ATA S ETS

CONCLUSIONS In this study we have seen two natural data sets whose

conformance to Benford’s Law vary drastically; a set of hydrology statistics whose leading digit distribution is a very close fit, and a set of controversial climate statistics whose ending digit distribution reveals many discrepancies from a Benford data set.

It is still an open question as to which data sets should conform to Benford’s Law; though many researchers believe that this law is a characteristic intrinsic to our number system, there is no set of criteria that guarantee conformance to the expected leading digit distribution.

It has been our goal to provide an in-depth Benford analysis of several large natural data sets, to demonstrate Benford techniques, and to address common issues that arise in a Benford analysis.