Wednesday, January 30, 2008

Benford's Law

While updating my page of DSP links, I noticed a new chapter to The Scientist and Engineer's Guide to Digital Signal Processing about the solution to Benford's law. This phenomenon was first noticed in 1881 by Simon Newcomb, but the rediscovery by Frank Benford in 1938 gives it its name. Both noticed something odd about books containing tables of logarithms, pages with data with 1 as the first digit were worn more than pages with other digits. Benford found this type of pattern in a wide variety of data sets such as numbers in magazine articles and baseball statistics. It seemed that the data should be random and the first number would have an equal probability to be any digit between 1 and 9, or 11.1% for each. In fact, the pattern of leading digits fits a logarithmic distribution, with 1 having a 30.1% probability.

This phenomenon was observed but not understood until recently, and that gives it the connection with DSP. A proof can be constructed using signal analysis techniques, in particular convolution, the Fourier transform and homomorphic processing. See Steven Smith's new chapter for details, but you might not need mathematics to understand the mystery. According to Smith, the key is to realize that we are transforming the data when we extract the first digit. Since it so easy to see the first digit, we do not realize that we are performing any operation on the data. This operation is taking the anti-logarithm of the data, and that gives the logarithmic pattern to the leading digits. Wikipedia has a few other explanations as well, but I think there is a counterexample that is much easier to understand:

If we assume that a random event has a random leading digit, then we are saying that a number with a leading 1 has an 11% probability. That would be true if we had a single die with 9 sides. Any of the sides would correspond to a different leading digit, but what about a die with 20 sides? There are 11 digits that would start with 1 out of the 20 for a 55% probability of a 1. For an average 7 sided die, it would be 14%. In fact, I think you will find it would never be less than 11% for any number of sides, but we find many cases where it is greater than 11%. So for data that can have any value between 1 and N, the probability of a 1 must be greater than 11%.

This is not a complete proof, and even if true, the counter-example does not address events like the roll of two or more die, or fractional data, but it is enough to remove my initial disbelief in Bedford's law. If books with tables of numbers are worn the most on pages with data that has a leading 1, it could just be that the type of data being analyzed follows this 1 to N pattern. I would imagine that many types of data sets have this property.

For more information about Bedford's law I recommend the pages on Mathworld and Wikipedia. It was also featured in the episodeThe Running Man, of the TV show NUMB3RS.

No comments: