Hash collisions in computer security

Last week, Robin of the YouTube channel 8-bit Show and Tell wondered out loud on Twitter why Chrome flags Netracer 1.1, a modern indie Commodore 64 game, as malware. I think this is a classic case of hashing algorithms having gone wrong. In this blog post, I’ll explain what a hash collision is, using this collision of my hobby of retro computing and my day job of information security as an example.

What a hashing algorithm is

Hash collision example in computer security
According to Google’s Safe Browsing service, this indie Commodore 64 game from 2008 is malicious software. It’s a false positive, almost certainly a result of a SHA1 hash collision.

Hashing algorithms are one of the more difficult concepts in computer security to grasp. I think it’s because we tend to start with the math rather than starting with what we use them for. Hashing algorithms take a stream of bytes and return a fixed-length hexadecimal code after doing a ton of cryptographic math on it.

For example, here’s what the word password looks like after you hash it with MD5:

5f4dcc3b5aa765d61d8327deb882cf99

But if you change that letter o to a zero, it looks like this after you hash it with MD5:

bed128365216c019988915ed3add75fb

It’s completely different, even though I changed a single character.

So what’s it good for?

The concept is that there are times we need to generate a unique signature for a file or a string. Let’s look at some common use cases for them. I can think of at least four problems file hashes can solve.

Use of file hashes in antivirus and antimalware

In the case of antivirus or anti-malware, we need to be able to see if a file is safe without storing a copy of every malicious file ever created and comparing the file in question against each of them. Doing that would be slow, would require more disk space than anyone wants to dedicate to antivirus, and it’s also not something anyone wants to keep laying around.

Similarly, what a simple antivirus or anti-malware product does is calculate a hash for each file it scans, and then compares that hash against a list of hashes for known viruses war malware.

Use of file hashes to avoid storing passwords

Another application for it is passwords. It is poor practice to save passwords in human readable or even machine readable form. You don’t want someone stealing your password database and then decoding it. Hashing algorithms are ideal for this, because they are not reversible. When you go to change your password, a properly implemented system doesn’t store your password. Instead, it calculates the hash of your password and stores the hash. When you log back in again, it takes the password you entered, calculates the hash, then compares that hash against the stored hash. It solves a real problem rather elegantly. Dictionary attacks against hashed passwords are still possible, but more difficult and expensive.

Using file hashes to check the integrity of downloads

A third use for hashing algorithms is to make sure that a file you downloaded didn’t get corrupted or tampered with. I’m sure you’ve seen this on web pages sometimes. The author of the file will provide a file hash and the algorithm. Then, after you download the file, you calculate the hash using the same algorithm, and if it matches, the file is good. If it doesn’t match, the file is corrupt, either intentionally or unintentionally.

I wrote about how to check the integrity of downloaded files and your Windows directory way back in 2011. How time flies.

Use of file hashes in threat intelligence

Computer security professionals sometimes use a fourth case. It is possible to subscribe to threat intelligence services. Many of these services provide file hashes of suspicious files that turned up in breach investigations. The theory being that if you have a system that you suspect is compromised, and you find a match for one or more of those file hashes, that can give insight into who attacked you and what their motives were.

Hashing algorithms people use today

A large number of hashing algorithms exist, but the three most common are MD5, SHA-1, and SHA-2, sometimes also called SHA-256.

Getting back to the original problem, if each hash is supposed to be exactly one of a kind, why did Chrome flag an indie Commodore 64 game as malware?

The problem of hash collisions

The problem is that in theory, every hash algorithm is imperfect. In practice, we know the three algorithms in common use today are imperfect. The chances of two different files or two different strings producing the same hash are low, but they are not zero. In the case of MD5, we’re talking approximately one in 18 million.

The problem is, the more imperfect the algorithm is, the faster it runs. We’ve known since 1996 that MD5 was prone to collisions and since 2004 how to exploit it. But it’s still in widespread use because it is extremely fast, at least if you aren’t using an 8 bit processor to compute it. Google demonstrated an example of a collision in SHA1 in 2017. Theoretical flaws in SHA-2 first surfaced in 2008, but as far as anyone knows, attacking those flaws is not practical yet.

Either in theory or in practice, the word “password” with its MD5 of 5f4dcc3b5aa765d61d8327deb882cf99 has a twin that also computes the same MD5 in spite of being completely different.

Diagnosing the Netracer false positive as a hash collision

When Robin brought up the problem with the Netracer binary, I downloaded the same file using Firefox. Firefox also flagged it as malicious. The antivirus on my computer didn’t care. Then I uploaded the file to Virustotal, and none of its tools flagged it as malicious.

The anti-malware built into your web browser and antivirus programs are not looking for exactly the same things. Google Safe Browsing, which both Chrome and Firefox use, utilizes technology licensed from ESET with Google’s own set of signatures. ESET, from what I can tell, still utilizes SHA1 at the time of this writing.

I think it’s a SHA1 hash collision between something in Google’s list and the Netracer 1.1 C-64 binary.

SHA1 hasn’t been completely phased out, but NIST has set its EOL date for 2030. Of potential interest to retro PC enthusiasts, the reason you can’t download updates for older versions of Windows anymore, at least from Microsoft, is because Microsoft discontinued use of SHA1 in Windows Update in August 2020, and pre-Vista Windows versions use SHA1 rather than SHA2.

The problem with signature based antivirus

If you are interviewing for an entry level security job, it is a pretty safe bet that someone during your interview process will ask you how to evade antivirus. It’s one of those things security professionals who don’t know how to conduct a technical interview like to ask.

The answer is that you change the file in a way that still allows it to run. Any change that you make to that malware will change its hash. And traditional antivirus that relies on file hashes will miss it.

My fellow retro enthusiasts might be amused to hear that security professionals regard UPX, the compression tool, as a hacking tool. Retro enthusiasts love UPX because it allows us to shrink binaries, including binaries for some vintage operating systems, so they take less space on a disk. But compressing malware with UPX can also evade some security tools.

So if Leif Bloomquist, the file author, is concerned that this binary might be scaring people, options exist. Changing one byte in the executable file will do the trick. Inserting a single hexadecimal EA, the 6502 NOP instruction, is the easy option, if increasing the size of the binary would be OK. If every byte in the 12K executable is in use and it needs to remain 12K, any other change that still allows the program to run will do. Changing a color, changing the design of one of the sprites, or reordering two steps of the initialization routine would do the job.

Modern enterprise-grade antivirus and EDR and XDR tools analyze software behavior rather than relying on signatures. It’s much more difficult to code, which is why large companies change out these tools about as often as Elizabeth Taylor changed husbands. But a big part of their appeal is that they don’t have to deal with hash collisions.

If you found this post informative or helpful, please share it!