# Collision attacks

In choosing a given key size or hash width, we need to consider the possibility of collision attacks. Collision attacks essentially stem from the observation that if you take two sets of random values from some range, then the size of each set need only be around the square root of the number of potential values for there to be a good chance of at least one collision, or value present in both groups. For example, let's say you have a random number generator that generates numbers fairly between 1 and 65536 (=216). Now, you generate 256 numbers and call them Group A. And then you generate another 256 numbers and call them Group B. There'll be a greater than 60% chance that at least one number in Group A is also in Group B.

At present, collision attacks are not generally practical with 128-bit keys and hashes, but are arguably going to become practical quite soon, certainly within the required "confidentially lifetime" of many people's data.

## Birthday attacks

A common variant is called a birthday attack or example of the birthday problem. It usually refers to observing repetitions that are likely to occur given enough applications of a particular function or algorithm. It stems from the observation that if you pick 23 people at random, there's a better than 50% chance that some pair will share the same birthday.

The phenomenon extends most obviously to hash functions. Let's say we use a 128-bit MD5 hash to verify the integrity of a message. There are 2128 possible hash codes. But after we have hashed "only" 264 random messages (the square root of 2128), then just by chance it is more likely than not that some pair of those messages will have the same hash code. With SHA-1, a 160-bit hash, then a pair only becomes likely after around 280 messages. (Note that we're just talking about collisions that naturally would occur by random chance; we're not referring to active attacks against these particular hash functions.)

A similar phenomenon occurs if we encrypt a large number of blocks using certain block cipher modes. In OFB (output feedback) mode, then after encrypting in the order of 264 blocks, it's likely that we'll re-use part of the previous cipher stream. At present, this is an inconceivably large amount of data; in 20 or 30 years, it may well be a realistic amount of data to encrypt with a given key.

## Meet-in-the-middle attacks

In a meet-in-the-middle attack, Group A is typically some pre-calculated set of encrypted blocks, and Group B is a set of "real" encrypted blocks. The idea is to use the pre-calculated set of blocks to try and find at least one "real" encrypted block that you can decrypt. A typical version of the attack is as follows:

• we pre-calculate the encrypted version of some known or suspected block of plaintext (for example, some known file header, known part of a connection protocol etc) for about 2n/2 possible keys, where the key length is 2n— note that this is the square root of the possible number of keys, and much much fewer than half the keys that we'd expect for a brute-force attack;
• then, we sit and make lots of observations: scan lots of encrypted conversations, scan lots of encrypted files etc; after an average of 2n/2 observations, we'd expect a collision— i.e. for one of our pre-calculated encrypted blocks to occur— with over 60% certainty (assuming our prediction of the plaintext was correct, of course);
• you can trade more pre-calculated cipher texts for fewer required observations and vice versa, or just trade fewer of either for a smaller percentage chance of getting a collision.

Collision attacks are particularly deadly for DES, with its 56-bit key size and small block size. If you pre-calculate 230 keys (about a billion, or about 8 GB of data), then after 220 observations (about a million), there is around a 1% chance of a collision; pre-calculating 232 keys gives roughly that chance after 218 observations (around 200,000). These are very feasible calculations and not an unreasonable number of encryptions in some applications, and are part of the reason why you shouldn't entrust data of any value to DES.

At present, scaling these calculations up to 128-bit key sizes (and a 128-bit block size) sounds infeasible. Working with 262 precalculations and 262 observations gives us around a 6% chance of a collision, which might be just about the cut-off point for "worth an expensive attack" for some adversaries. But storing 262 blocks requires about 50 million terabytes (and we can't compute them fast enough each time to not have to store them). And we're talking about 4 billion billion observations. For now, this doesn't constitute a practical attack. But how far-fetched will these numbers appear in 10 years, 20 years, 30 years...?

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.