Saturday, 14 January 2017

About a 40 Character Compression

"SHA-1 is an algorithm and what it does is: it takes some data as input and generates a unique 40 character string from it." (source)

That is just so horribly incorrect I had to take a note - a hash function typically does not return unique results.

Wikipedia says "A hash function is any function that can be used to map data of arbitrary size to data of fixed size."

With this definition the function has no chance to do that.

Just like you'd fail to assign a 8 bit (2 ^ 8 states) unique ID to anyone in a sizeable country, you cannot assign even a 160 bit unique value to any input of typically well over 160 byte (>> bit) size.

Obviously if one could assign a unique 160 bit identifier to data of any size, that would mean a universal 160 bit (40 character) compression. Possibly in a huge number of steps though, a decompression algorithm could just step through all the valid inputs spiralling through 1 bit, 2 bit etc. candidates, calculating the hash of each, checking when it matches up the desired result.

And otherwise this would mean a bijection between sets of different number of elements - 2^160 vs. infinite.

Anyway, the conclusion is even Git doesn't have the powers to make SHA-1 do the job. OMG :)

No comments:

Post a Comment