Monoidal hashing

Recently I had this problem – how do I efficiently hash an immutable rope data-structure as it is being constructed?

sealed trait Rope extends Product with Serializable {
def hash: Int
def equals(obj: Any): Boolean = obj match {
case that: Rope => this.toString == that.toString
case _ => false
}
}
final case class Combine(left: Rope, right: Rope) extends Rope {
val hash = f(left.hash, right.hash)
def toString = left.toString + right.toString
}
final case class Leaf(value: String) extends Rope {
val hash = g(value)
def toString = value
}

Note that due to the way equality works on ropes – ropes that have equal string representation are equal, f: Int -> Int -> Int must be associative and g: String -> Int must form a monoid homomorphism with g("") being the identity with respect to f. Why? Let’s unpack.

• Equal ropes must have equal hashes.
• Combine(Leaf(""), x) == x, therefore f(g(""), x.hash) == x.hash.
• Combine(x, Leaf("")) == x, thereofre f(x.hash, g("")) == x.hash.
• Combine(Combine(x, y), z) == Combine(x, Combine(y, z)), therefore f(f(x.hash, y.hash), z.hash) == f(x.hash, f(y.hash, z.hash)).
• Leaf(x + y) == Combine(Leaf(x), Leaf(y)), therefore g(x + y) == f(g(x), g(y)).

Re-discovering $$SL_2$$

Once I realized that g must be a monoid homomorphism, I started searching for papers or articles on the topic of “monoidal hashes” and “monoid homomorphism hashing”, but all I could find was a single algorithm that is called SL2:

• The original paper introducing the algorithm.
• A video explaining some additional motivation and the algorithm.
• An MSc thesis discussing the hash function and some attacks against it.
• Some improvements to make it more resilient against cryptographic attacks.

It is rather complex and while there is a highly optimized version of it, I did not really want to use a cryptographic-grade hashing algorithm just to hash a data-structure into an Int. I had to find something simpler.

Designing a hash function

I set out to explore my options in a newly created Jupyter notebook. First, I defined my monoidal hash MHash:

class MHash:
bits: int                           # The number of bits in our hash output.
empty: int                          # Identity element w.r.t. combine.
def combine(self, left, right): int # Combines two hashes, associative.
def lift(self, byte): int           # Monoid homomorphism from a single byte.

def random(self):
return randbits(self.bits)

def compute(self, bytes):
r = self.empty
for b in bytes:
r = self.combine(r, self.lift(b))
return r

And some simple tests for it:

def test_properties(m):
# Test associativity.
for _ in range(100):
h1, h2, h3 = randbits(m.bits), randbits(m.bits), randbits(m.bits)
assert m.combine(m.combine(h1, h2), h3) == m.combine(h1, m.combine(h2, h3))
...

Now I could implement a really simplistic “hash function” and test it:

class MHash0(MHash):
bits  = 32
mask  = 2 ** bits - 1
empty = 1
def combine(self, x, y): return (x * y) & self.mask
def lift(self, byte):    return byte

test_properties(MHash0())

What makes a good hash function?

Quoting wikipedia:

A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same probability.

Furthermore, a good hash function must satisfy a so-called avalanche criterion: if an input is changes slightly (for example, flipping a single bit), the output changes significantly (e.g., half the output bits flip).

Both are pretty easy to test:

1. Generate a random input sequence.
2. Flip some bits, record how the output changes.
3. Compute how many bits of the output are effectively used – given a random input, how much uncertainty do I have about the resulting hash in bits? I called this metric “Output Entropy”, the closer it is to 32 bits, the better.
4. Compute how many bits of the output are effectively controlled by the input bits – given a random input and a random bit flip, how much uncertainty do I have about the resulting hash in bits? I called this metric “Avalanche Entropy”, the closer it is to 32 bits, the better.
5. Compute how many bits of the output change when a single input bit flips. I called this metric “Mean Avalanche Size”. The closer it is to 16 bits, the better.

The simplistic algorithm MHash0 from above gave me:

Monoid Homomorphism  = True
Non-commutative      = False
Output Entropy       = 28.637 bits
Avalanche Entropy    = 22.791 bits
Mean Avalanche Size  = 10.134 bits

Notice that I also tested for non-commutativity for obvious reasons. Non-commutativity requirement gave me a pause for a bit. I knew of only one way to make a non-commutative monoid - use matrix multiplication.

I tried that:

class MHash2(MHash):
bits  = 32
mask  = 2 ** bits - 1
empty = (1 << 24) + (0x00 << 16) + (0x00 << 8) + (1 << 0)

def combine(self, x, y):
assert 0 <= x < 2**self.bits
assert 0 <= y < 2**self.bits
x00, x01, x10, x11 = (x >> 24) & 0xFF, (x >> 16) & 0xFF, (x >> 8) & 0xFF, (x >> 0) & 0xFF
y00, y01, y10, y11 = (y >> 24) & 0xFF, (y >> 16) & 0xFF, (y >> 8) & 0xFF, (y >> 0) & 0xFF

mul = lambda x, y: x * y
add = lambda x, y: x + y
z00 = add(mul(x00, y00), mul(x01, y10)) & 0xFF
z01 = add(mul(x00, y01), mul(x01, y11)) & 0xFF
z10 = add(mul(x10, y00), mul(x11, y10)) & 0xFF
z11 = add(mul(x10, y01), mul(x11, y11)) & 0xFF

return (z00 << 24) + (z01 << 16) + (z10 << 8) + (z11 << 0)

# Just mixes the 256 values of the input byte up,
# "diffusing" them randomly among 2**32 output
# values. This operation is bijective.
def lift(self, x):
x = (x * 0xfffffd + 0xe6546b64) & self.mask
x = x ^ (x >> 16)
x = (x * 1729) & self.mask
x = x ^ (x >> 10)
x = (x * 1729) & self.mask
x = x ^ (x >> 16)
return x

But was not entirely satisfied with the results:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 30.957 bits
Avalanche Entropy    = 30.724 bits
Mean Avalanche Size  = 13.884 bits

A careful choice of lift function is necessary for good results here. Perhaps a carefully chosen translation table for each byte value would make it work even better…

I decided to pursue a different route. Searching for “non-commutative integer monoid” on Google and Math Overflow eventually led me to this simple monoid:

(a, b) x (c, d) = (a + b * c, b * d)

I could not believe at first that it was a non-commutative monoid, but it passed all my tests.

class MHash4(MHash):
bits  = 32
mask  = 2 ** bits - 1
empty = 1

def combine(self, x, y):
x, x1 = (x >> 16) & 0xFFFF, (x >> 0) & 0xFFFF
y, y1 = (y >> 16) & 0xFFFF, (y >> 0) & 0xFFFF
z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0xFFFF
return (z << 16) + z1
...

Unfortunately, it was still lousy:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 30.678 bits
Avalanche Entropy    = 29.438 bits
Mean Avalanche Size  = 13.607 bits

Well, if you think about it, multiplication mod 2 ** 32 is not a particularly good idea. Factors of 2 will eventually accumulate, making the whole thing zero out.

Suddenly, I realized that any monoid with annihilators will not do. For example,

(0, 0) x (c, d) = (0 + 0 * c, 0 * d) = (0, 0)

Which means that if we find a string that hashes to (0, 0), we can add arbitrary suffices to it to get easy hash collisions. We could keep the second component always odd, ensuring that no annihilators will ever arise, but we lose that one bit of output entirely.

class MHash6(MHash):
...

def random(self):
while True:
x = randbits(32)
if (x & 1) != 1: continue
return x

def combine(self, x, y):
x, x1 = (x >> 16) & 0xFFFF, (x >> 0) & 0xFFFF
y, y1 = (y >> 16) & 0xFFFF, (y >> 0) & 0xFFFF
z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0xFFFF
return (z << 16) + z1

def lift(self, x):
...
if (x & 1) == 0: x = x ^ 1
return x
Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 31.641 bits
Avalanche Entropy    = 30.962 bits
Mean Avalanche Size  = 15.508 bits

What if instead of making the right component odd, we multiply them by 2 and add 1 right before our monoid operation?

class MHash7(MHash):
...

def combine(self, x, y):
x, x1 = (x >> 16) & 0xFFFF, x & 0xFFFF
y, y1 = (y >> 16) & 0xFFFF, y & 0xFFFF
x1 = (x1 << 1) + 1
y1 = (y1 << 1) + 1
z, z1 = (x + x1 * y) & 0xFFFF, (x1 * y1) & 0x1FFFF
z1 = z1 >> 1
return (z << 16) + z1

...

Surprisingly, this hash function does really well:

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 32.000 bits
Avalanche Entropy    = 31.942 bits
Mean Avalanche Size  = 15.956 bits

It has no obvious annihilators, and although it’s avalanche diagram has some obvious structure (afterall, we are just adding and multiplying integers), it still has amazing statistics.

I did some experiments with general linear group as well (computing elements mod 251 instead of mod 256, very useful trick ensuring that matrices don’t end up being degenerate), but even though the resulting hash-function has better statistics,

Monoid Homomorphism  = True
Non-commutative      = True
Output Entropy       = 31.999 bits
Avalanche Entropy    = 31.999 bits
Mean Avalanche Size  = 15.997 bits

it is much more complex:

class MHash5(MHash):
bits  = 32
mask  = 2 ** bits - 1
empty = (1 << 24) + (0x00 << 16) + (0x00 << 8) + (1 << 0)

table = ...

def combine(self, x, y):
x00, x01, x10, x11 = (x >> 24) & 0xFF, (x >> 16) & 0xFF, (x >> 8) & 0xFF, (x >> 0) & 0xFF
y00, y01, y10, y11 = (y >> 24) & 0xFF, (y >> 16) & 0xFF, (y >> 8) & 0xFF, (y >> 0) & 0xFF

mul = lambda x, y: x * y
add = lambda x, y: x + y

z00 = (add(mul(x00, y00), mul(x01, y10)) % 251) & 0xFF
z01 = (add(mul(x00, y01), mul(x01, y11)) % 251) & 0xFF
z10 = (add(mul(x10, y00), mul(x11, y10)) % 251) & 0xFF
z11 = (add(mul(x10, y01), mul(x11, y11)) % 251) & 0xFF

return (z00 << 24) + (z01 << 16) + (z10 << 8) + (z11 << 0)

def lift(self, x):
# All elements of the table have det != 0.
return self.table[x]

To be continued…

The problem turned out to be mostly theoretical because I realized that by computing the hash ahead of time I would lose precious ticks when constructing large ropes just to e.g. print them out.