HashMap and HashSet
The HashMap and HashSet classes provide two different but similar ways to organize a set of objects using the same underlying data structure.
The HashSet class is used to manage a set of objects that have unique values. Passing an object to a HashSet object’s add method will add the object to the HashSet unless there is already an object in the HashSet that equals the object passed to the add method. There is also a contains method to find an object in a set and a remove method to remove an object from a set.
The HashMap class organizes objects into pairs. There are two roles in each pair. These roles are key and value. A HashMap object organizes the pairs of objects so that no two pairs have equal keys. This means that, for each key in a HashMap, there is exactly one corresponding value.
The HashMap class has a put method that is passed a key and value pair. If the key is not already in the HashMap, the pair is added to the HashMap. If the key is already present in the HashMap, the given value replaces the previous value associated with the key. There is also a get method that returns the value associated with a given key and a remove method that removes the key-value pair for a given key.
A HashMap is like a HashSet with each object in the set having an associated value. The HashSet class is implemented as a wrapper around HashMap that uses the same object for both the key and value of each object pair.
The internal data structure used to implement a HashMap allows an object pair to be found very quickly by looking for the object pair’s key. With exceptions explained later in this article, the amount of time a HashMap takes to find an object pair by its key object is about the same for all object pairs in a HashMap and independent of the number of object pairs in the HashMap.
One reason to use a HashMap or HashSet is because search, add, and remove operations are fast. A possible reason not to use a HashMap or HashSet is that they do not keep objects in any particular order. If you need objects kept in a particular order, you should use another collection class.
Although a HashMap object makes finding an object pair by its key object a fast operation, finding an object pair by its value is not so fast. A HashMap object’s containsValue method returns true if it can find an object pair with a value that is equal to the object passed as a parameter to the containsValue method. To find an object pair with a specified value, the containsValue method looks at each of the object pairs in the HashMap until it finds one with the given value. The containsValue method must check every object pair in the HashMap before it can conclude that there is no object pair with the specified value.
The internal data structure used in a HashMap that allows a HashMap to quickly find object pairs by their key is called a chained hash table. I will explain this data structure by explaining what a hash table is; then, I will explain the chained part. For the hash table to function properly, the objects in the hash table must be instances of classes whose equals and hashCode methods satisfy certain properties. So, before I explain what a hash table is, I will explain the equals and hashCode methods.
equals and hashCode Methods
The central idea of a hash table is that you can make an accurate guess, based on the value of a key object, where it belongs or where it will be found in the hash table. The guess is based on the value returned by the key object’s hashCode method. The hash table uses the key object’s equals method to verify the guess.
Every class inherits the equals and hashCode methods from the Object class. The implementations that the Object class provides for these methods are based on the assumption that two objects are equal if and only if they are the same object. The Object class’s equals method returns true only if its argument is == to the object it is associated with. The Object class’s hashCode method returns a number that the JVM uses to uniquely identify the object.
Instances of some classes are considered equal if they contain the same values. The java.lang.Integer and java.lang.String classes are examples of this. Classes of this sort should override the equals method with an implementation that returns true if two instances contain the same value. For example, the Integer classs equals method will return true when comparing two different Integer objects that contain the value 27.
If instances of such classes are to be kept in a hash table, the class also must override the hashCode method so that if two instances of a class are considered equal by its equals method, the class’s hashCode method returns the same result for both. Returning the same result for instances that are equal is the only requirement for what int the hashCode method should return. Deciding the best way to implement hashCode for a class may require some thought.
In a few cases, there is an obvious value for a class’s hashCode method to return. For example, the Integer class’s hashCode method returns the same int that is the value of the Integer object. Implementing the hashCode method this way means that it requires no computations and that every possible value for an Integer object is its own unique hash code value.
For classes designed to contain complex values composed of multiple values, the value returned by the hashCode method must be the result of a computation that is based on all of the values. In the case of a Point class whose instances contain an x and a y value, the value returned by the hashCode method is the exclusive-or of the two values. In the case of String, the hashCode method returns a value computed from all of the characters in the string. If part of an object’s value is in another object, the hash code computed for the object’s should be partially based on the other object’s hash code.
A hash table data structure uses an array to organize objects. When adding an object to a hash table, you divide the object’s hash code by the length of the array and use the remainder as an array index to put the object in the array. Figure 1 shows this organization.
Figure 1: Hash Table
The hash table shown in Figure 1 uses an array of length 7. When an object is added with a hash code of 21991147, the logic for the has table computes the remainder of the hash code divided by 7 (21991147^7), which is 3. Based on this computation, element 3 of the array gets the object. A similar computation is made when the other objects shown in Figure 1 are added to the hash table.
When it is time to find an object in the hash table, the hash table uses the key object’s hash code to compute the array index where it expects to find the object. If there is an object at that array index, the hash table uses the object’s equals method to determine whether the object at the array index actually is the object that it is searching for.
When a hash table is asked to add an object, it computes an array index from the object’s hash code. Sometimes, there is already a different object at the computed array index. A situation where a hash table must deal with different objects that it expects to find at the same array index is called a hash collision.
There are different variations on hash tables that differ in how they decide what to do when adding an object to a hash table creates a hash collision. For example, there is a variation called an open addressing that looks for a different place in the array to put an object when there is a hash collision.
The technique used in a HashMap to resolve a hash collision is called chained hashing. The idea of chained hashing is that more than one object can be associated with the same hash table array element. This is done by associating objects with an array element through a linked list, as shown in Figure 2.
Figure 2: Hash Collision
Figure 2 shows the hash table array connected to objects through linked list nodes. When an object to be added to the chained hash table has a hash collision with objects already in the hash table, it is added to the same linked list as the objects it collides with.
When looking for an object in a chained hash table, it is necessary to look at all of the objects on the linked list associated with the object’s hash code. The longer a linked list gets, the longer it takes to search. The more hash collisions occur, the more time it takes to search for objects.
The fundamental strategy for making chained hash map operations fast is to avoid hash collisions. There are two things to do that will keep the number of hash collisions low. One is to make it very unlikely that two different objects in a hash table will have the same hash code. The other is to make the array in the hash table large enough that it becomes unlikely that objects with different hash codes will collide. I will explain how to do these things later in this article. First, you will look at how the HashMap class exposes the underlying hash table data structure.
External Visibility of the Hash Table in HashMap
Figure 3 shows how the HashMap class uses the chained hash table data structure.
Figure 3: HashMap’s Use of a Chained Hash Table
The linked list nodes used by HashMap all refer to a key object. They also may refer to a value object and to a next node.
There are a few aspects of the HashMap class’s constructors and methods that expose the underlying hash table data structure. The HashMap class has a constructor with the signature public HashMap(int initialCapacity).
The argument to this constructor is the length of the hash table array that the HashMap will use when it is constructed. In the next section of this article, you will learn how to select a value for this parameter for good performance.
The entrySet() method returns a set of Map.Entry objects. Map.Entry is an interface implemented by objects that a HashMap uses as linked list nodes for its hash table. The linked list nodes used by a HashMap contain a reference to a key object, a reference to a value object, and a reference to a next liked list node.
The Map.Entry interface provides only limited access to the contents of a linked list node. It has methods to get the key object, to get the value object, and to set the value object for the linked list node. The Map.Entry interface does not include any methods to access other linked list nodes or objects.
Quality of hashCode()
There are two ways that hash collisions can occur. One way is that the hashCode method returns the same value for objects that have different values. If two objects with distinct values but the same hash code are added to the same HashMap, they collide.
Ideally, a class’s hashCode method should return a different value for each distinct value that the class’s instances can have. The hashCode method of some classes are implemented in this way. However, there are some classes for which this is impossible.
The hashCode method returns an int. Classes that contain a long value or complex values can have more distinct values than can be represented by an int. For example, String objects can represent many more values than an int. For such classes, the hashCode method should be implemented in a way that makes it unlikely that two objects in the same HashMap will have the same hash code.
The way to make it unlikely that two instances of the same class with distinct values will have the same hash code varies with the nature of the value. However, there are some general principles:
- If the instances of a class contain a complex value, the result of its hashCode method should be based on all the parts of the complex value.
- Similar values should have different hash code values. For example, the String class’s hashCode method considers the characters in the string and their sequence so that "tis" and "its" have different hash codes.
Capacity and Loading Factor
If the length of the hash table array is less than the number of objects in the HashMap, there must be hash collisions because there are not enough linked lists for each key-value pair to be by itself. The larger the hash table array, the less likely that objects with different hash codes will collide. There are two parameters to control this. These parameters are initial capacity and loading factor.
Initial capacity is the initial length of the hash table array. As mentioned previously, there is a constructor that allows you to specify this parameter.
Loading factor is a ratio of the number of objects in the HashMap to the length of the hash table array. If the number of objects in a HashMap exceeds this ratio, the HashMap object makes the hash table larger. For example, if the initial capacity of a HashMap object is 16 and its loading factor is 0.75, if the number of keys in the HashMap becomes larger than 12 (16×.75) the hash table array will be made larger.
When a HashMap object enlarges its hash table array, it creates a new hash table array that is approximately double the length of the old array. It then rehashes all of the objects in the HashMap to put them in the appropriate linked list for the larger array.
Because it involves all the objects in a HashMap, rehashing is an expensive operation. You can postpone or avoid rehashing by setting the initial capacity sufficiently high.
The default value for loading factor is 0.75. You can set it to a different value by using this constructor:
HashMap(int initialCapacity, float loadFactor)
The default value for initial capacity is 16. You can set it to a different value by using the constructor shown above or the constructor
If you use an iterator over a HashMap, excessively large values for initial capacity or small values for loading factor can make iterators slow. An iterator looks at everything in the entire hash table. If the hash table array is much longer than needed, the iterator will spend time looking at many empty linked lists.
Hash tables do not keep their contents in any particular order. Due the possibility of rehashing, the order that objects are kept in a hash table may change without warning.
Here are a few classes that are good alternatives to HashMap or HashSet in certain specialized circumstances.
- ConcurrentHashMap: The HashMap class is not thread safe. If two different threads want to modify a HashMap at the same time, it will be necessary to use synchronization locks to ensure that only one thread at a time accesses the HashMap. When multiple threads want to access a HashMap at the same time, they will be forced to wait for their turn. This can be a performance problem if there is a lot of concurrent access to the HashMap.
The ConcurrenthashMap class is a good alternative to HashMap in these cases. The ConcurrenthashMap class is thread safe. Multiple threads can modify a ConcurrenthashMap object at the same time without any external synchronization. The ConcurrenthashMap class does use some internal synchronization locks, but these only lock access to individual linked list nodes. Threads do not need to wait for each other unless multiples threads are concurrently trying to change the value associated with the same key or there are hash collisions.
The ConcurrenthashMap class is not a good alternative to HashMap in other circumstances, beacause it takes longer to perform the same operations when no concurrency in involved.
- IdentityHashMap: The IdentityHashMap class is similar to HashMap, but it compares objects using == and uses the System.identityHashCode method for each object’s hash code. This is useful only for the unusual situation where you are concerned only with how the object is identified by the JVM and not with the object’s contents.
- LinkedHashMap: The LinkedHashMap class combines a HashMap with a doubly linked list so that the keys in the HashMap can be kept in a particular order.
- WeakHashMap: This class differs from HashMap in that if there are no other references to key objects in a WeakHashMap, the key objects will be removed from the WeakHashMap by the garbage collector.
- WeakHashSet: This class differs from HashSet in that if there are no other references to objects in a WeakHashSet, the objects will be removed from the WeakHashSet by the garbage collector.
This article has explained how to use the HashSet and HashMap classes. It has also explained the internal organization of these classes and how to use them most efficiently.
About the Author
|Mark Grand is a consultant and book author with over 30 years of experience who specializes in Distributed Systems, Object-Oriented Design, and Java. He was the architect of the first commercial business-to-business e-commerce product to use the Internet.
Mark Grand is most widely known for his best selling design pattern books. Mark has taught for U.C. Berkeley, Sun, and other organizations. He is based in the Atlanta area and has been involved with object-oriented programming and design since 1982.