Bloom Filters: the false positive rate

In the previous section, we looked at a Java implementation of a Bloom filter to represent a set of strings. In this section, we look more closely at how the bloom filter performs as we change two key parameters:

• the number of bits per item: that is, how many times more bits there are in the filter than the maximum number of items we want to represent in the set;
• the number of hash codes, and hence the number of bits that we actually set for each object that we add to the set.

Intuitively, we can envisage the following general principles coming into play:

• the fuller the bit set is (or put another way, the fewer bits we allow per item), the greater the chance of false positives as we "accidentally" marking items as present as we add more items to the set;
• when the bit set is relatively empty (with more bits per item), then the more bits we require to be marked as set in order to mark an item as "present"— i.e. the more hash codes per item— the lower the chance of false positives, because for a given item potentially in the set, there's less chance of some random combination of bits from other items also accidentally marking that item as present when it isn't;
• for a given filter size, there's a point of no return, at which having more hash codes simply means that we fill up the bit set too quickly as we add items— and hence get more false positives— than with fewer hash codes.

To see what this all adds up to in practice, we look at the example of adding 65,536 (=214) random strings to a Bloom filter whose size varies from 214 to 223 bits, and where the number of hash codes varies from 1 to 8. For each of these combinations of size and number of hash codes, we create the corresponding Bloom filter then add 65,536 items to it. We also instantiate a regular HashSet and add the strings to this, so that we can determine what strings we "really" added to the Bloom filter and thus detect false positives. Then, the final step is to generate a number (in this case 1 million) random strings. For each of these not in the set (which of course we expect to be the vast majority), we count the number reported to be present by the Bloom filter: these are the false positives. The code looks essentially like this:

```Random r = new SecureRandom();
final int noItems = 1 << 14;

for (int log2bits = 14; log2bits <= 23; log2bits++) {
for (int noHashes = 1; noHashes <= 8; noHashes++) {
double noFalsePositives = 0;
int noNotIn = 0;

BloomFilter bf = new BloomFilter(log2bits, noHashes);
// Add items fo Bloom filter
for (int itemNo = 0; itemNo < noItems; itemNo++) {
String s = randomString(r);
}
// Now test for false positives
for (int n = 0; n < NO_FALSE_POSITIVE_TESTS; n++) {
String s = randomString(r);
noNotIn++;
if (bf.contains(s)) noFalsePositives++;
}
}
double falsePositiveRate = noNotIn == 0 ? 0d :
noFalsePositives / noNotIn;
}
}
```

Notice that we use SecureRandom rather than the regular java.lang.Random class. Due to weaknesses in the LCG algorithm used by java.lang.Random, the latter is not suitable for this kind of simulation where we need to generate a large numbe of highly random combinations.

To create our random strings, we use the following method: we arbitrarily choose to create strings whose length centres around 5 (a "normal word length") and whose characters are randomly chosen from "regular letters" plus a few accented characters. In other words, we aim to create "vaguely typical" strings without going to too much effort1.

```private static final String LETTERS =
"abcdefghijklmnopqrstuvexyABCDEFGHIJKLMNOPQRSTUVWYXZzéèêàôû";
private static String randomString(Random r) {
int wordLen;
do {
wordLen = 5 + 2 * (int) (r.nextGaussian() + 0.5d);
} while (wordLen < 1 || wordLen > 12);
StringBuilder sb = new StringBuilder(wordLen);
for (int i = 0; i < wordLen; i++) {
char ch = LETTERS.charAt(r.nextInt(LETTERS.length()));
sb.append(ch);
}
return new String(sb);
}
```

Next: graph of results

On the next page, we show the result of this experiment, and hence how false positive rate varies as a function of filter size and number of hash codes.

1. Of course in reality, typical words of a given language behave in a much more complex manner than this: the probability of a given letter is highly dependent on the previous letter, for example.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.