In many applications, generating a unique user identifier is a common task that is frequently taken for granted. A simple, effective way to generate user identifiers is to generate a large random number, and let the properties of small probabilities all but guarantee that each generated number will be different from any previously generated identifiers (see: UUIDv4, or cryptographic nonces).

Suppose that you don’t feeling like breaking out the pencil, paper, and brain to work out the collision probabilities as the size of your user base increases. Instead, you reach for a nearby, relatively crumb-free keyboard and decide to amuse yourself by writing a simulation.

The program is simple. It executes the following:

  1. Create a map / dict / object data structure to serve as an associative array.
  2. Set the global collision counter to 0.
  3. Generate a random 53-bit random integer (in order to allow parity with Javascript, which only allows 53-bit integers).
  4. Check if this number exists in the associative array. If it does not, add it to the array. If it does, increment the global collision counter.
  5. Jump to step 3 until the program crashes due to hitting a 2 GB memory limit.

This is clearly a simple program that can be written in multiple ways (for example, by using each language’s set data structure). However, the point of this micro-benchmark is to roughly answer the following questions for a number of languages:

  1. How many unique integers can we stuff into an associative array before the program crashes?
  2. How much time does each language take to get to its crash point?

And of course, to see how many collisions are actually encountered.

The contenders are as follows. Before looking at the results, try to formulate your guesses for the answers to the questions above for each language.

Results

The graph below shows the number of unique integers put into the associative array before the program either hit the 2 GB memory limit, or crashed for other reasons.

The x-axis is time since the program was started, and the y-axis is the number of integers in the associative map in millions. In this graph, lines terminate when the program crashes. Steeper lines represent faster associative array performance, since a lookup and insertion is performed at each iteration of the loop.

As for the original intent of this expedition, the total number of collisions across all runs for all languages was precisely 0.

Code and raw results

Github

Notes