Probability of a duplicate within a set of random values.

There are a number of situations where you need to know the probability of a duplicate in a set of random values. The main one that comes up for me and which was the inspiration for this page is "How long do my identifiers need to be?"

You can avoid having to check for duplicates when you make your identifiers long enough. Just make them long enough that the odds of a duplicate are acceptably low. For instance, if you expect a maximum of a billion total records over the lifetime of an application, and you can handle a one in a million chance of there ever being a duplicate, then you can look it up here. It turns out to be 88 bits.

Here are the variables involved. Given any two you can calculate the third.

  • Number of Bits - How much information is in each value. A 32 bit random integer, for instance, from a perfect random number source, would have 32 bits.
  • Number of Values - The number of values in the set we are considering. The number of records for which you are generating unique identifiers, for instance.
  • Probability of a Duplicate - The chances of there being a duplicate value in the set.
  • Find probability of a duplicate, given value bit length and number of expected repetitions.

    Find number of bits required for a given probability of a duplicate and a given number of expected repetitions.

    Find number of repetitions possible for a value of a given bit length and probability of a duplicate.

    This form lets you calculate any of the three values based on the other two. Click on the arrow above the value you want to calculate and enter values for the other two values in the boxes. You can enter a plain number (i.e. 100), a percentage (i.e. 5%), a ratio with 1 in the numerator (1:10), e notation (2.5e3), or a power of 10 (i.e. 10^6).

    Number of Repetitions
    Probability of Duplicate
    Number of Bits

    Some additional notes:

  • "How many random bits do I really have?" - when you use 50 bits from a low quality random number source you don't really have 50 bits of randomness. Maybe you only have 45. The number of bits here is only equivalent to the number you are using if you have a perfect random number source, approximations of which can be found at random.org and elsewhere. Simply using the high security random number generator one probably gets quite close but it's important to mention that one should always add a few bits to be safe.
  • People often just fall back to UUIDs because they are huge. How much overkill do they have, exactly? They contain 122 bits, which is enough to have less than a one in a million chance of a single duplicate after generating a quadrillion of them. If you can afford their size, fine. But there is a certain joy in knowing you are only using the bits you actually need.