One method that can sometimes be applied to a set of numbers to see if they are genuine or made up is Benford’s Law. I came across it some years ago, and Tim Harford mentioned it this weekend in relation to the data the Greek government supplied for their submission to join the eurozone:
In the late 1990s, eurozone wannabes squeezed and stretched to meet the criteria for accession, including low inflation and government deficits, and moderate levels of debt. The criteria were somewhat irksome, especially for an economy such as Greece, but nevertheless the Greeks seemed to comply…Eventually, it became clear that the Greek numbers did not quite add up.
The law applies to sets of numbers (e.g., lengths of rivers in a country, line items in a company’s cost accounts) that don’t follow any particular statistical distribution, and ideally span several orders of magnitude (i.e., from 5km to 5,000km, or £0.50 to £500,000). It presumes also that the sample you take has no selection effects (see here for a recent catchy BBC headline regarding the London riots that I’m fairly sure contains some stunning selection bias).
Specifically, Benford’s Law has something to say about how many times the digits 1 to 9 should be the leading digit of the numbers in your set. So, items costing £4.00, £412.99, and £4,999.00 all start with the digit 4, and you’d expect around 10% of your set to start with a 4.
That doesn’t seem so remarkable, with there being just 9 leading digits to pick from, of course you’d expect each to appear around 10% of the time. But in fact no, the lower the digit, the more likely you should see it. So, numbers in your set should begin with a 1 around 30% of the time, down to 10% for a 4, and with 9’s only showing up 5% of the time. Here is chart I lifted from Wikipedia, regarding the population of the world’s countries. The red bar is the data, and the black dots are what Benford’s law predicts:
Source: Wikipedia (here is a thought about the number of wikipedia articles from an earlier post)
Using ideas like this to investigate sets of numbers reminds me of a project, at my first employer, from many years ago. A bank had participated in some collaboration with another consultancy, whereby many banks submitted their trade data (number of trades, fees charged, etc). The consultant then calculated some statistics for each bank, such as average cost per trade, and shared them anonymously back with all the participants. That should have been the end of it, except for that the consultant had left more information in his spreadsheet than intended, as data was given to many decimal places, albeit formatted to just show one decimal place. One of the banks came to my company with that, and asked what could be done.
Now, if you’ve got an average cost per trade of £2.08375638201 you can write a VBA programme to automate the process of cycling through all possibilities to back calculate and see which two numbers (an integer which is the number of trades, divided by the total revenue derived as shown in one of the participant’s annual reports and of the format £xxxxx.xx) would have got to that answer. Cute.
Other Posts on Maths/Distributions:
Why your friends are more popular than you
Selection effects and the London riots
God and maths tests