Probability of a duplicate within a set of random values.
There are a number of situations where you need to know the probability
of a duplicate in a set of random values. The main one that comes up
for me and which was the inspiration for this page is
"How long do my identifiers need to be?"
You can avoid having
to check for duplicates when you make your identifiers long enough.
Just make them long enough that the odds of a duplicate are acceptably
low. For instance, if you expect a maximum of a billion total records
over the lifetime of an application, and you can handle a one in a
million chance of there ever being a duplicate, then you can look it
up here. It turns out to be 88 bits.
Here are the variables involved. Given any two you can calculate the
third.
Number of Bits - How much information is in each
value. A 32 bit random integer, for instance, from a perfect
random number source, would have 32 bits.
Number of Values - The number of values in the set
we are considering. The number of records for which you are generating
unique identifiers, for instance.
Probability of a Duplicate - The chances of there being
a duplicate value in the set.
Find probability of a duplicate, given value bit length and
number of expected repetitions.
Find number of bits required for a given probability of a duplicate
and a given number of expected repetitions.
Find number of repetitions possible for a value of a given bit length
and probability of a duplicate.
This form lets you calculate any of the three values based on the other two.
Click on the arrow above the value you want to calculate and enter values
for the other two values in the boxes. You can enter a plain number (i.e. 100), a
percentage (i.e. 5%), a ratio with 1 in the numerator (1:10),
e notation (2.5e3), or a power of 10 (i.e. 10^6).
Number of Repetitions
Probability of Duplicate
Number of Bits
Some additional notes:
"How many random bits do I really have?" - when you use 50 bits
from a low quality random number source you don't really have 50
bits of randomness. Maybe you only have 45. The number of bits
here is only equivalent to the number you are using if you have a
perfect random number source, approximations of which can be found
at random.org and elsewhere. Simply using the high security random
number generator one probably gets quite close but it's important
to mention that one should always add a few bits to be safe.
People often just fall back to UUIDs because they are huge. How
much overkill do they have, exactly? They contain 122 bits, which
is enough to have less than a one in a million chance of a single
duplicate after generating a quadrillion of them. If you can afford
their size, fine. But there is a certain joy in knowing you are only using
the bits you actually need.