In: Computer Science
What is the hash value of a file, and what does it mean if two files have the same hash value?
Hash values can be thought of as fingerprints for files. The contents of a file are processed through a cryptographic algorithm, and a unique numerical value – the hash value - is produced that identifies the contents of the file. If the contents are modified in any way, the value of the hash will also change significantly. Two algorithms are currently widely used to produce hash values: the MD5 and SHA1 algorithms.
Hashing is an algorithm that calculates a fixed-size bit string value from a file. A file basically contains blocks of data. Hashing transforms this data into a far shorter fixed-length value or key which represents the original string. The hash value can be considered the distilled summary of everything within that file.
An md5 sum is 128 bits (16 bytes). Since the number of different possible file contents is infinite, and the number of different possible md5 sums is finite, there is a possibility (though small probability in most cases) of collision of hashes. In other words, two different files can produce the same sum when hashed with md5.
Because of this, it's better in some cases to use a higher bit hash (more possible different outputs), to reduce the (already low) probability of an accidental hash collision, and increase the difficulty of creating a deliberate hash collision through brute force.