The technique of finding an item from a collection of items is a common operation in everyday programming. But, the technique must be efficient, reliable, and of course simple to implement. The quality of efficiency is not an easy meter to obtain even though the problem may seem trivial. Therefore, to solve this vexing question, a number of techniques have been devised. These techniques are appropriate under most circumstances. This means there is no general formula for efficiency—only adequate ones—but they can be an excellent tool at the hand of programmer given the knowledge. Hashing, the keynote of this article, is one such technique. Java has direct support of this in the form of hashing classes in the API library. The writeup detours into the techniques of hashing before delving into the Java Hashing Classes in brief.
Overview of Hashing
The idea of hashing begins with the use of a key (the hash code) of an item, such as the Social Security number of an employee, bank account number of a bank customer, or the batch number of a product to determine the exact location of the item in the table (the hash table). The chosen key at its pre-existence, however, does not always have to be a number. It may be a string/word, date, and so forth, but they are always converted into a number finally by a method (the hash function) that provides the key.
Without Hashing
Suppose we are given a list of items, a specific item to be searched for, and the instruction to insert the item in an appropriate location if it is not found. Normally, what we do is search the list items one by one (sequential search), and retrieve it if found. Otherwise, we insert it into the next empty location. This may be good when there is no other option or the list is harmlessly small. The technique itself is effective, no doubt, but rudimentary; it is far from being an efficient one. However, if the items are arranged in a certain order (ascending/descending), we can apply a very efficient searching technique, called a binary search. But then, insertion is difficult because maintaining the order becomes a big overhead. For example, consider a list containing the following numbers: 1,6,9,20,45,67. To insert 19 in the list, we need to move 20,45,67 one place over. (Another technique is to insert at the end of the list and sort the list again, but still this is an inefficient idea). Now, imagine if 0 comes, we need to move all the items; worst case, imagine if the list contains thousands or more items! Can we deliver efficiently? Searching may be fine, but how efficient would the overall performance be if the operation also includes insertion and deletion? Clearly, we need some better ideas.
With Hashing
With hashing techniques, a data structure, called a hash table, is used where keys are mapped to an array location by a hash function. The keys to the array location are basically array indices. An element stored in the hash table is directly mapped by the hash function. The hash function generates the key with the help of a mathematical formula. This formula, when applied to a key, produces an integer. This integer is then used as the index for the key in the hash table. The formula can be derived through a variety of mathematical expressions. One of the simplest hash functions is the division method.
h(k, 0)=( k % s + i ) % s
where:
k is the key,
s is the size of the hash table,
and i is the probe variable, beginning with i=0. (The use of this variable will be clear once you walk through the following example.)
To illustrate better, let’s walk through an example. Consider a hash table of size, s=10. The negative value,-1, given in the hash table denotes that the location is initially empty.
Hash Table, ht[10] | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Step 1: k=23
Therefore,
h(23, 0) = (23 % 10 + 0 ) % 10 h(23, 0) = (3 + 0) % 10 k(23, 0) = 3
Because ht[3] is empty, we can insert 23 into this location.
Hash Table, ht[10] | -1 | -1 | -1 | 23 | -1 | -1 | -1 | -1 | -1 | -1 |
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Step 2: k=82
Therefore,
h(82, 0) = (82 % 10 + 0 ) % 10 h(82, 0) = (2 + 0) % 10 k(82, 0) = 2
Similarly, ht[2] is also empty. We can insert 82 at index 2.
Hash Table, ht[10] | -1 | -1 | 82 | 23 | -1 | -1 | -1 | -1 | -1 | -1 |
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
In much the same way, we can insert 49, 67, 28, 55,and 96,
Hash Table, ht[10] | -1 | -1 | 82 | 23 | -1 | 55 | 96 | 67 | 28 | 49 |
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
But, as we try to insert, say 32, a collision occurs.
h(32, 0) = (32 % 10 + 0 ) % 10 h(32, 0) = (2 + 0) % 10 k(32, 0) = 2
The hash table, ht[2], is already occupied by 82; we cannot overwrite with the new key of 32. In such a case, one way to overcome the problem is by using linear probing (recall the probe variable, i, we used with the hash function). We increment the probe variable and calculate the hash function as follows.
h(32, 0) = (32 % 10 + 1 ) % 10 h(32, 0) = (2 + 1) % 10 k(32, 0) = 3
But, hash table, ht[3], is again occupied; therefore, we again increment the probe variable and recalculate the hash function.
h(32, 0) = (32 % 10 + 2 ) % 10 h(32, 0) = (2 + 2) % 10 k(32, 0) = 4
Now, the hash table location ht[4] is empty. We can safely insert the key there.
Hash Table, ht[10] | -1 | -1 | 82 | 23 | 32 | 55 | 96 | 67 | 28 | 49 |
index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Therefore, the idea behind linear probing is to keep on probing until an empty location is found or the hash table is filled up (incrementing the probe variable until an empty location is found or the same location encountered as in the beginning if we consider the array circular).
Note that there are different techniques of collision resolution. Also, a mathematical formulation for hash function can be derived in many different ways.
This, though a rudimentary explanation of one of them, can give a fairly good idea to begin with and understand why searching through hashing is efficient. Like all techniques, this also has its cons, especially when the hash table fills up and frequent collisions occur. To delve deeper, take up any text book on algorithms or data structure; you’ll find more details there.
Java Hashing Classes
The best part of Java API is that it makes life simple. Creating a hash table from scratch is complex. Java provides classes such as HashSet, HashMap, and Hashtable to enable programmers to use hashing without the hassles of implementing this mechanism from scratch. These are all generic type classes.
The HashSet, for example, is derived from the Set interface that incorporates a hashing mechanism to the collection Set. This means that the implementation cannot contain duplicates and removes any duplicates in the Collection when it is constructed. Observe the output in the following example of how duplicate values in the HashSet is removed even though the array of String contains duplicate values.
package org.mano.example; import java.util.Arrays; import java.util.HashSet; import java.util.Set; public class HashSetDemo { public static void main(String[] args){ String[] colors={"purple", "green", "red", "magenta", "blue", "red", "purple"}; Set<String> set=new HashSet<> (Arrays.asList(colors)); for(String s: set){ System.out.println(s); } } }
Output:
red magenta green blue purple
The class HashMap is derived from the Map interface. The principle with the Map interface is that it stores as a key value pairs and cannot contain duplicate keys. That means each key can map only one associated value according to the one-to-one mapping function. The main difference between Map and Set is that Map maintains key value pairs that Set does not maintain. The Hashtable class also implements the Map interface. But, there is an important distinction between Hashtable and HashMap though they are almost similar in functionality. HashMap is unsynchronized and allows null keys and null values. Due to its unsynchronized nature, multiple threads can modify it concurrently, but it must be synchronized explicitly. As per the Java API documentation, to prohibit unsynchronized access to this map, it is better to synchronize it at the time of creation.
Map<String, Integer> map=Collections.synchronizedMap (new HashMap<>());
Hashtable does not support null as its key or value pairs. This class is synchronized by default, but the Java API Documentation recommends the use of ConcurrentHashMap in place of Hashtable when thread-safe and highly concurrent implementation is sought because Hashtable is retrofitted to implement the Map interface so that it can be a part of the Java Collection Framework.
Java uses a very popular solution to hash table collision where each cell of the table is deemed as a “bucket” represented as a linked list of all the key-value pairs that hash to that cell. Because every hash implementation uses the data structure of a hash table, there are a couple of parameters that affect its performance, called initial capacity and load factor. The initial capacity is the determinant of the number of buckets in the hash table at the time of creation and the load factor is the number of occupied cells in the hash table. The default load factor implemented in Java is 0.75, which is a good trade-off between time and space costs. If this ratio gets closer to 1.0 or more, the chance of collision increases and performance degrades.
Intriguingly, the load factor scale is a doubled-edged sword in the sense that if it increases, we get a better memory utilization; on the other hand, the program is bound to run slower. If the load factor decreases, the program speeds up due to fewer collisions at the expense of poor memory utilization (most of the hash table remains empty). The default load factor (0.75) seems to be the magic number to balance the situation.
There are many other classes in Java that implement hashing mechanism. These three are the basic.
Conclusion
The hashing mechanism is a very innovative concept in its own right and without doubt a complex one. This article tried to simplify it and, as with all simplification, many intricate and crucial things are overlooked. But, this would perhaps provide a clue to the overall understanding of this technique. Because, to appreciate Java API, one must know the basics and how it works internally. Implementation is not that difficult when the base is strong.