You’ve Probably Been Doing Random Wrong

The random function rand() appears in many many programming languages everything from Microsoft Excel to R to SQL. This isn’t to talk about the quality of random number generators, this is to talk about the distribution of random numbers. (Although if you are curious, here is an very interesting video talking about random number generators in C++ ).

Normally (*) the random numbers are distributed evenly throughout the range. However, for most situations, I would say that the data should follow Benford’s law (wiki link for those curious). If you are replicating “natural” processes through a system or if, like me, you are randomly selecting from a ranked list, you want to follow a distribution like Benford’s law.

I’m a big fan of Benford’s law as a quick and dirty way to check numbers as they come into a project. It makes a practical first check of the spread of frequency of the first digit can tell right away if there data looks good. Many data sets in the wild don’t exactly follow the law, if you look on wikipedia they give good examples of where it won’t apply.

Some programming languages do allow you to change the distribution pattern with a function. There is, however, a quick and easy way to change any basic

rand()

function into a better distribution:

(10^rand()) /10

Back to my example where I am pulling from randomly from a ranked list. By using the quick and dirty adjustment to the distribution, I’m able to pull the lower ranked entries to a frequency which is more in line with what their usage will be. This better simulates what will really happen while still choosing at random.

 

(*) – yeah that is punny