Advanced use of hash codes in Java: duplicate elimination
A common use of the HashSet is for various types of duplicate elimination.
The hash set, like the case of hash map keys, stores items in their entirety. Let's say we're collecting
some information about the different clients that connect to our web server. The first time
that we see a particular client (identified by IP address for this example), we want to log some
information about the client to a database. But if the same client subsequently makes another request, we
don't want to log the information again.
Ideally, we don't want to have to hit the database unnecessarily. So as a first attempt, we can use a HashSet to
store which IP addresses we've already seen. Let's say we want to log data from about 50,000 clients:
Set<String> ipsSeen = new HashSet<String>(50000);
public void logInformation(String ipAddress, String info) {
boolean isNew;
synchronized (ipsSeen) {
isNew = ipsSeen.add(ipAddress);
}
if (isNew) {
// persist (ipAddress,info) to the database
}
}
This will reduce the number of database hits because we don't have to hit the database to see
if we've already saved information on a particular client. But it's also quite a memory-hungry way
of achieving that purpose.
Now, supposing we don't actually store the IP addresses in memory, but just
store the hash codes of those we've already seen. (An alert reader will have spotted
that in 'raw bytes', a 'normal' IP address (i.e. not an IPv6 address) is actually 4 bytes, i.e. on today's Internet, a numeric
IP address can be its own hash code.) So we could use a HashSet<Integer>, calculating the
hash code of ipAddress each time before adding it to the set:
Set<Integer> ipsSeen = new HashSet<Integer>(50000);
...
isNew = ipsSeen.add(ipAddress.hashCode());
Putting just the hsah code in the HashSet is a bit better than the entire string,
but it's still quite heavy-handed. On the next page, we'll consider an alternative technique. In effect,
we can allocate a hash table that has a very high load factor– i.e. the number of
buckets is several times the number of items– and store only one bit per item.
This allows us to store a hash set as a BitSet,
with some caveats.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.