Leaky Hash Tables


In a recent project, I've had to keep a set of elements in memory and a hash table was a possible candidate.
I ran a very simple benchmark on insertion times, using a C hashtable library.
This benchmark inserts 8-byte elements in a hash table, for numeric keys ranging from 1 to 20 million while measuring the insertion time for each element.

hash table insertion times

This graph reveals the internal structure of the hash table, which expands when a certain number of items is reached. The values for different thresholds are often prime numbers, or powers of 2.

All inserts were under 30 milliseconds, except for the following values:

ElementsInsertion time (ms)Time change
127,80032
255,60860×1.875
511,18393×1.55
1,022,366194×2.08
2,044,732415×2.14
4,089,456821×1.98
8,178,8971709×2.08
16,357,7993621×2.12

Here, powers of 2 times a thousand elements. Other algorithms manage to avoid this kind of problematic behaviour.

Abstractions can be considered leaky not only when the underlying data structure is exposed in the public API, but also when the behaviour of some special cases put a strain on the whole process.

For a great read on the implementation of hash tables in C, head to chapter 18 of Beautiful Code, called “Python's Dictionary Implementation: Being All Things to All People”.