Monoidal hashing

Last modified on June 29, 2020 by Alex

Recently I had this problem – how do I efficiently hash an immutable rope data-structure as it is being constructed?

Note that due to the way equality works on ropes – ropes that have equal string representation are equal, f: Int -> Int -> Int must be associative and g: String -> Int must form a monoid homomorphism with g("") being the identity with respect to f. Why? Let’s unpack.

  • Equal ropes must have equal hashes.
  • Combine(Leaf(""), x) == x, therefore f(g(""), x.hash) == x.hash.
  • Combine(x, Leaf("")) == x, thereofre f(x.hash, g("")) == x.hash.
  • Combine(Combine(x, y), z) == Combine(x, Combine(y, z)), therefore f(f(x.hash, y.hash), z.hash) == f(x.hash, f(y.hash, z.hash)).
  • Leaf(x + y) == Combine(Leaf(x), Leaf(y)), therefore g(x + y) == f(g(x), g(y)).

Re-discovering \(SL_2\)

Once I realized that g must be a monoid homomorphism, I started searching for papers or articles on the topic of “monoidal hashes” and “monoid homomorphism hashing”, but all I could find was a single algorithm that is called SL2:

  • The original paper introducing the algorithm.
  • A video explaining some additional motivation and the algorithm.
  • An MSc thesis discussing the hash function and some attacks against it.
  • Some improvements to make it more resilient against cryptographic attacks.

It is rather complex and while there is a highly optimized version of it, I did not really want to use a cryptographic-grade hashing algorithm just to hash a data-structure into an Int. I had to find something simpler.

Designing a hash function

I set out to explore my options in a newly created Jupyter notebook. First, I defined my monoidal hash MHash:

And some simple tests for it:

Now I could implement a really simplistic “hash function” and test it:

What makes a good hash function?

Quoting wikipedia:

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability.

Furthermore, a good hash function must satisfy a so-called avalanche criterion: if an input is changes slightly (for example, flipping a single bit), the output changes significantly (e.g., half the output bits flip).

Both are pretty easy to test:

  1. Generate a random input sequence.
  2. Flip some bits, record how the output changes.
  3. Compute how many bits of the output are effectively used – given a random input, how much uncertainty do I have about the resulting hash in bits? I called this metric “Output Entropy”, the closer it is to 32 bits, the better.
  4. Compute how many bits of the output are effectively controlled by the input bits – given a random input and a random bit flip, how much uncertainty do I have about the resulting hash in bits? I called this metric “Avalanche Entropy”, the closer it is to 32 bits, the better.
  5. Compute how many bits of the output change when a single input bit flips. I called this metric “Mean Avalanche Size”. The closer it is to 16 bits, the better.

The simplistic algorithm MHash0 from above gave me:

Notice that I also tested for non-commutativity for obvious reasons. Non-commutativity requirement gave me a pause for a bit. I knew of only one way to make a non-commutative monoid - use matrix multiplication.

I tried that:

But was not entirely satisfied with the results:

A careful choice of lift function is necessary for good results here. Perhaps a carefully chosen translation table for each byte value would make it work even better…

I decided to pursue a different route. Searching for “non-commutative integer monoid” on Google and Math Overflow eventually led me to this simple monoid:

I could not believe at first that it was a non-commutative monoid, but it passed all my tests.

Unfortunately, it was still lousy:

Well, if you think about it, multiplication mod 2 ** 32 is not a particularly good idea. Factors of 2 will eventually accumulate, making the whole thing zero out.

Suddenly, I realized that any monoid with annihilators will not do. For example,

Which means that if we find a string that hashes to (0, 0), we can add arbitrary suffices to it to get easy hash collisions. We could keep the second component always odd, ensuring that no annihilators will ever arise, but we lose that one bit of output entirely.

What if instead of making the right component odd, we multiply them by 2 and add 1 right before our monoid operation?

Surprisingly, this hash function does really well:

It has no obvious annihilators, and although it’s avalanche diagram has some obvious structure (afterall, we are just adding and multiplying integers), it still has amazing statistics.

I did some experiments with general linear group as well (computing elements mod 251 instead of mod 256, very useful trick ensuring that matrices don’t end up being degenerate), but even though the resulting hash-function has better statistics,

it is much more complex:

To be continued…

The problem turned out to be mostly theoretical because I realized that by computing the hash ahead of time I would lose precious ticks when constructing large ropes just to e.g. print them out.