http://www.developer.com/

Back to article

The Google Collections Library


March 20, 2008

Introduction

One of the things that first attracted me to Java many years ago was the inclusion of a standard collections library in the platform. At the time, in the C++ world, the STL (Standard Template Library) had yet to catch on, and developers were either left to find a collections library that they could buy and use (Rogue Wave became very popular), or more often write their own. I have lost count of how many times I implemented a linked list of something—different primitives or objects for different purposes. Then, there were the more complex collections, self-balancing binary trees, and hash tables. Although it might have been good for staying in touch with the software engineering basics, it was not so good for productivity.

Java changed all that. Even the 1.0 and 1.1 collection classes were a huge improvement, but the introduction of the Java Collections Framework with Java 1.2 was a quantum leap forwards in productivity. Since then, the standard collections have been regularly enhanced and improved, and with the addition of Generics in Java 5, the collections were updated to take advantage of those to give (at least compile-time) type checking. Doug Lea's concurrent collections (part of java.util.concurrent since Java 5) is a welcome addition as well, giving us collections like Queue and ConcurrentMap, which are ideal for use in concurrent systems.

Despite all of this, there are holes in the standard collections, items that are often re-implemented by developers, sometimes in a less-than-optimal way. There are pain points too. The cost/benefit analysis of Generics (or at least their implementation in Java) is an ongoing discussion, but whether you like them or not, they are very verbose; for example, looking back at collections in 2003 you would have seen something like:

Map mapOfLists = new HashMap();

Now, a map of things to a list of things might look more like this:

Map<String, List<String>> mapOfLists =
   new HashMap<String, List<String>>();

This is not exactly a fair comparison; the second definition carries a lot more compile time information, allowing the Java compiler to ensure that only Strings are used as keys, and Lists of Strings are used as values in the HashMap. However, you probably can notice that there is a lot of repeated information, the type signature is duplicated for the definition, and the initialization, and let's face it, it's not very pretty.

The Google Collections Library is a newly open-sourced library donated to the Java community by Google. It is intended to make some incremental improvements in usability for the existing Java collections, and add some new collections and features of its own. In this it is not alone; comparisons with the Apache Commons Collections are inevitable and the selection of a collection augmentation library is largely down to a matter of taste. This article will concentrate on the Google collections library, a library that I have used (albeit in an internal form) for over a year on a number of projects in Google very successfully. It feels like a natural extension to the Java collections framework, has been extremely reliable and performant, and frankly, the prospect of working without it on future projects is not appealing. Fortunately, because it is now available as an open source project, that should never be an issue.

Why Use the Google Collections Library?

Of course, the reasons I will give here are subjective. That said, I believe there are a number of good reasons to select the Google collections library to augment the Java collections framework:

  1. Readiness: Although the version is (currently) 0.5 alpha, this is more so that the APIs can change if necessary to improve the library as things are learned about the way it will be used externally to Google. To read into this version number and status that the libraries are excessively buggy or not ready for use yet would be incorrect; this same code has been tried and tested on many large Google projects for some time now and it is likely that most of the edge cases have been found. It also boasts 85% test coverage. Of course, this does mean that the API could change and although that may be a valid concern, it is likely that the changes will be minor and easily handled, but there is no guarantee. If this is a showstopper, keep an eye on the status to see when the API becomes more stable.
  2. Consistency: In use, both the new collections in the library, plus the new ease-of-use features, feel like a natural extension to the Java collections framework. This is not accidental—a great deal of work has gone into making the collections consistent with the behavior of the Java collections, and has been overseen by engineers who actually worked on the Java collections when they were implemented by Sun (for example, Josh Bloch). In particular, Generics are handled in a way identical to the Java collections framework.
  3. Size: The jar file for all of the new functionality is currently around 350k. This should not be a deal-breaker for most projects.
  4. Documentation: The javadocs are pretty thorough by any standards, and especially so for a library of this kind.
  5. Performance: These same collections are used in projects at Google where performance is a priority. Lots of work has gone into optimization.
  6. New functionality: The ease-of-use improvements and functional concepts included in the library are particularly attractive for systems that use collections extensively. For example, filtering results out of collections, or applying constraints.

To be fair, I should point out some potential concerns when using the library:

  1. The convenience creators are at odds with a type-inference proposal for Java 7, and could mean that in the future developers may have to convert some of their initialization code or choose to go with two standards.
  2. Furthermore, although some of these collections, or ideas from them, may make it into a JSR or two in the Java 7 or Java 8 timeframe, others may not. In other words, by using these collections now, you may have some more work to do to make them standards compliant in the future.
  3. As mentioned above, the API could still change.

How to Get It

The Google Collections Library can be downloaded from http://code.google.com/p/google-collections/ and at the time of writing this article, is at version 0.5. Indeed, given the lack of guarantees about the API stability at this time, it is possible that there will be changes in the API that might make the examples given here out of date, but it is expected that the differences will be small and hopefully obvious to fix.

To use the library in your own projects, simply include the jar file found when you unpack the downloaded archive. Javadocs and a src zip are also included, which most IDEs can tie to the jar file when you define the library. This will make using the library easier, and also help when debugging your projects.

Easy Wins

Let me start with some of the simple features, which are nonetheless among my favorites in the library. The primary one is Convenience Creators.

Now, go back to that example of the Map of things to List of things, using generics:

Map<String, List<String>> mapOfLists =
   new HashMap<String, List<String>>();

I believe this will be a pretty common sight to a lot of developers. Indeed, I have written far more complicated collection declaration and initialization type signatures than this, almost always repeating the signature on both the left and right side of the =.

For Java 7, there is a type inference proposal that would take the above and change it to something like this:

Map<String, List<String>> mapOfLists =
   new HashMap<>();

Certainly an improvement, and clearly the compiler can still get enough information from this to make some common sense assumptions and set up the new HashMap generic type signature accordingly. You will have to wait and see whether it does make it in to the language.

In the meantime, though, the Google collections library offers another alternative to save some of that verbosity:

Map<String, List<String>> mapOfLists =
   Maps.newHashMap();

The Maps.newHashMap() is a static method on a utility class called Maps from the collections library. It exploits a loophole in the way generics are implemented in Java so that the convenience creator newHashMap does not actually need to have the type signature passed in to it. These convenience creators are provided for all of the standard Java collections, and all of the new Google collections as well. It's amazing how much cleaner and more readable this simple idea can make your code.

For example:

List<String> listOfStrings =
   new ArrayList<String>();

can be replaced by:

List<String>
listOfStrings = Lists.newArrayList();

Not really related to collections per-se, but still convenient, is the join method from the Join class. Anyone who has used a scripting language with join (like Python) will get this straight away. You can create a string of items joined together from an array, collection, iterable, or varargs list. The first parameter is a delimiter that will be placed between the joined items. For example, to create a file path from a collection of subdirectories, you could just use:

String[] subdirs = new String[] {"usr", "local", "lib"};
String directory = Join.join(PATH_SEPARATOR, subdirs);

Certainly, this is not a radical new feature, but it is probably the only join method you will ever need and it saves you writing your own.

New Collections

As you would expect from a collections library, there are a number of new collections:

Multimap

Let me go back to that first example again, the map of things to a list of things, but this time I'll mix up the types a bit:

Map<Integer, List<String>> mapOfLists =
   Maps.newHashMap();

Chances are you have written something like this before. Maps of things to lists of things is a really common pattern, making it easy to collect items into identifiable buckets. If you have written this, you probably know the sort of code you have to write when you want to add a new entry into the map:

List<String> list = mapOfLists.get(key);
if (list == null) {
   list = new ArrayList<String>();
   mapOfLists.put(key, list);
}
list.add(value);

In other words, you have to ensure there is a list to receive the new value when adding that value to the map. If you are doing things correctly, you probably should also check for empty lists when removing items, and if the list is emptied, remove that as well to save resources. It's not uncommon to see a lot of this kind of boilerplate.

Multimap is an implementation (or actually a set of implementations) of this for you. In fact, there are implementations of Multimap based on ArrayLists, LinkedLists, Trees, and more.

Multimap<Integer, String> numbers =
   Multimaps.newArrayListMultiMap();
numbers.put(1, "One");
numbers.put(1, "Uno");
numbers.put(2, "Two");
numbers.put(2, "Dos");
numbers.remove(1,"One");
numbers.removeAll(2);

As you dig further into Multimap, you will find it has lots of power, and this introduction only scratches the surface. For example, if you had a map of lists and wanted to find the maximum of any of the values held in the lists, it would require a lot of iteration and comparison code. With Multimap, you could simply use Collections.max(numbers.values()). There are plenty of other ways like this where using Multimap can save you time and effort.

Multiset

Multiset is perhaps a little less commonly used than Multimap, but very useful for histograms and counting purposes:

Multiset<String> histogram =
   Multisets.newHashMultiSet();

histogram.add("Hello");
histogram.add("World", 3);
histogram.add("Hello");
histogram.add("!");

int count;
count = histogram.count("Hello");    // 2
count = histogram.count("World");    // 3
count = histogram.count("!");        // 1
count = histogram.count("Fred");     // 0

Internally, Multiset implementations use an efficient counter implementation, giving much better performance for most operations than using a Map or List. Implementations are provided based on Enums, Hashes, LinkedHashes, Trees, and more, and there's even a ConcurrentMultiset that allows fully concurrent updates with very little locking required.

BiMap

BiMap provides bi-directional map functionality. In a bidirectional map, both keys and values are unique, and looking up a key from a value is possible. For example:

BiMap<Integer, String> numbers = Maps.newHashBiMap();
numbers.put(1, "one");
numbers.put(2, "two");
numbers.put(3, "three");
numbers.put(4, "four");
numbers.put(5, "five");

String one = numbers.get(1);                     // one
int three = numbers.inverse().get("three");      // three
Integer nine = numbers.inverse().get("nine");    // null
// NullPointerException
int oops = numbers.inverse().get("nine");

Although it's certainly not unique to using the Google Collections Library, watch out for that unboxing gotcha! Even though the get("three") call here works fine, the get("nine") line will give a null pointer exception. It is valid to ask a BiMap for a key or value it doesn't have (just like with a normal map), and it will return null. The NullPointerException occurs when the compiler tries to unbox that Integer for you into an int.

PrimitiveArrays

Java supports primitive types as well as object-oriented hierarchies descending from Object. This is another one of those oft-argued features of Java, the argument being that primitives were put in for reasons of efficiency. Collections, on the other hand, deal only in Objects.

Because auto-boxing and unboxing were added to Java 5, it is possible to make it appear that a collection is actually taking and returning primitives (as long as you are aware of some of the issues, like the NullPointerException on unboxing demonstrated above), but, in fact, behind the scenes, each primitive added to a collection is being boxed before being added. This means that the internal representation is less than optimal in some circumstances.

Say that you have a million primitives in an array, but you want to access it as a collection (perhaps to use it as a parameter to a method that requires a collection and not an array). Also, suppose that you know the data is only going to be access sparsely; perhaps only a dozen elements will be pulled out of the collection randomly, but you don't know which dozen will be chosen.

Creating a collection out of the primitive array would work, but it will be slow and very wasteful of memory:

int[] intArray = new int[1000000];

// Assume it has some data put in it here...

List<Integer> intList = Lists.newArrayList(1000000);
for (int i : intArray) {
   intList.add(i);
}

System.out.println(intList.get(5));
System.out.println(intList.get(707070));

It works, but yuk!

How about:

int[] intArray = new int[1000000];
List<Integer> intList =
   PrimitiveArrays.asList(intArray);
System.out.println(intList.get(5));
System.out.println(intList.get(707070));

In this case, intList holds an implementation that holds on to the original array and only boxes the values when they are retrieved. Values can be set in the collection as well, but it does have the restriction that the value passed in cannot be null (as is possible with an object array).

The PrimitiveArrays are an easy win if you have a large primitive array and access it only sparsely, but should not be used if you have a small array with many accesses, because the autoboxing overhead is paid on each access.

ReferenceMap

The ReferenceMap is quite a specialized class, but nonetheless an extremely useful one. If you ever write a cache implementation of some kind, ReferenceMap will be your best friend.

Caches, or weak reference maps, are an easy way to shortcut work in an application. Instead of looking up an item repeatedly from a slow or expensive source like a database, once you have found it the first time, you could put it into a WeakHashMap and for future lookups you could attempt to see whether there is a match in the WeakHashMap already. Because the keys in a WeakHashMap are weakly referenced, if there are no other strong references to the key at garbage collection time, the value can be removed from the heap and the space recovered, whereupon the entry in the map simply disappears.

ReferenceMap is quite a bit more powerful than the WeakHashMap. Instead of only the key being weak referenced, ReferenceMap allows any combination of Weak, Soft, or Strong references for either the keys or the values of the map, so it should match any conceivable caching implementation need.

It is also based on ConcurrentMap, making it ideal for use in concurrent systems. In fact, a Strong:Strong instance of ReferenceMap is semantically identical to ConcurrentMap, and a Weak:Strong instance is practically a drop-in replacement for WeakHashMap. Any other combination also can be used, and if either the key or the value is reclaimed, the item is removed from ReferenceMap.

There are also Soft references. Although the specification of Soft references is itself rather soft, the intention is that a Soft reference is a little stronger than a Weak one. In the case of a Weak reference, the entry is removed whenever a garbage collection event happens and there are no strong references to the object. In the case of a Soft reference, the object may be retained until the resources are needed; in other words, it might not be garbage collected until heap space starts to run low, holding it cached for as long as possible. This is down to the implementation of the VM in question, however, and often the VMs still seem fairly eager to collect Soft references as well.

An example initialization might look like this:

new ReferenceMap<String,
   Integer>(ReferenceType.WEAK, ReferenceType.SOFT);

Immutable Collections

When the Java collections framework was created, it introduced a way to return unmodifiable collections. That is, wrappers were put around collections to prevent unintentional modification of the contents of those collections.

However, unmodifiable collections have one flaw: They can change. Even though the recipient of the unmodifiable collection cannot change the contents, the original collection on which the unmodifiable one is based can still be changed, and this affects the unmodifiable collection because they are sharing the backing collection.

In other words, a programmer might incorrectly assume that the unmodifiable collection they have cannot change. This is not true.

The Google collections library introduces Immutable collections as an alternative. These take a copy of the original collection and then make it immutable, so that a recipient can rely on that collection never changing.

For example:

final List<String> immutableList =
   Lists.immutableList("Hello", "World");
// UnsupportedOperationException!
immutableList.add("!");

And, to contrast the Java unmodifiableList with the Google Collections Library immutableList:

final List<String> baseList =
   Lists.newArrayList("Hello", "World");
final List<String> unmodifiableList =
   Collections.unmodifiableList(baseList);
final List<String> immutableList =
   Lists.immutableList(baseList);

baseList.add("!");    // modify the original list

// prints [Hello, World, !] - changed
System.out.println(unmodifiableList);
// prints [Hello, World] - unchanged
System.out.println(immutableList);

This is often a very good combination with a final reference to the immutable collection, but there is still danger! Even though the collection itself cannot be altered (in other words, elements cannot be added, removed, or replaced), any mutable objects stored in the collection can still be modified! For this reason, it is recommended that mutable values should not be stored in immutable collections, because it could cause a great deal of confusion and much of the benefit of using an immutable collection will be lost. Of course, there are always exceptions, but if you do choose to mix mutable and immutable, you had better be pretty good at documenting the reasons and consequences.

Sprinkle in Some Functional Flavor

Hopefully, by now you are already downloading the Google Collections Library, thinking it is the answer to happy and stress-free collections. But wait: There's more!

Although Java is not a functional language, it is possible to use functional concepts in at least a limited scope fairly easily, and these can lead to big readability and efficiency gains.

Say, for example, that you have a list of cars, and each car has a name and a price. You might want to separate out all of the expensive cars into a collection to examine in your program.

The conventional way to do this would be to loop through the collection of cars and check the price. If the price is above a certain limit, you could either perform some operation involving the car, or copy that car into a new collection of expensive cars.

The Google collections library offers another alternative—predicates.

Predicates

A predicate is a way to specify some selection criteria based on an instance of a class, and then that predicate can be used to selectively filter out matching instances of that class from a collection. Here is an example that might make it clearer:

final Predicate<Car> expensiveCar = new Predicate<Car>() {
   public boolean apply(Car car) {
      return car.price > 50000;
   }
}

List<Car> cars = Lists.newArrayList();
cars.add(new Car("Ford Taurus", 20000));
cars.add(new Car("Tesla", 90000));
cars.add(new Car("Toyota Camry", 25000));
cars.add(new Car("McClaren F1", 600000));

finalList<Car> premiumCars =
   Lists.immutableList(Iterables.filter(cars, expensiveCar));

In this example, you create a predicate called expensiveCar; it returns true if the price of the car is above 50000. The apply method of the Predicate has to be filled out with your selection criteria. This is then applied to the list of cars using Iterables.filter() (or Iterators.filter()) to create an immutable list of premiumCars (that will contain only the Tesla and the McLaren F1).

This is quite an elegant functional solution, although the Java syntax for inner classes takes away some of the punch (if you wanted to define the Predicate in-line, you would have to use the anonymous inner class syntax). There are several proposals for closures in Java 7 that would make the syntax much cleaner for this example. Anyway, although this is not a particular win in this simple case, as code gets more complex this solution can often be a lot more clean and readable than if statements in loops.

Functions

As well as predicates for filtering lists, Functions give you a way to transform elements from one collection into a whole new collection, potentially of differing types. This is like the map functionality from many functional languages.

For example:

final Function<Integer, String> nameForNumber =
   new Function<Integer, String>() {
   public String apply(Integer from) {
      return numbers.get(from);
   }
}

List<Integer> sequence =
   Lists.newArrayList(new Integer[] {1,2,3,5,3,1,4});

for (String name : Iterables.transform(sequence,
   nameForNumber)) {
   System.out.println(name);
}

The generic type signature to the new Function you are creating indicates that your function takes an Integer and converts it to a String in some fashion. The apply method conforms to these generic types, taking an Integer from value, and returning a String.

In this example, you are using the numbers BiMap you created earlier, to map the Integers to their String names. You then transform an ArrayList of Integers into their string names, using the Iterables.transform function. Very easy, very elegant.

This example illustrates the use of Functions nicely, but there is actually an easier way of doing it. The Functions class defines a number of static convenience methods that provide commonly used functions, one of these is forMap(), which takes a map and creates a function to apply that map to a collection, so in the case above you could replace the Function inner class definition with:

Function<Integer, String> nameForNumber =
   Functions.forMap(numbers);

There are similar classes called Predicates and Constraints (coming up) with pre-defined convenience definitions of common predicates and constraints.

Constraints

The third functional programming concept that the library provides is Constraints. These provide extra controls over what values can be added to a collection. Constraints are undergoing a lot of change in the library at present, so rather than go into a detailed example, I will simply mention that they are there and can be used now, but that if you use them, there is a good chance you will need to update that code when upgrading to future versions of the Google Collections Library.

Status

The 0.5 version belies the maturity and reliability of this library, but it is good to remember that it is not complete yet. In particular, even though the library works with both Java 5 and Java 6, it does not yet take advantage of all Java 6 features (such as Navigable interfaces). Later versions will take advantage of the Java 6 features, but a version for Java 5 will continue to be maintained until Java 7 comes out.

Conclusion

The Google collections library has the potential to both increase your productivity and significantly clean up your code. I hope I have laid out the decision points clearly in this article for anyone thinking of using it on a project. Because the library is open source, the risks are relatively small, but the Java 7 standards may diverge from what this library does now, meaning that in the future some refactoring might be required to bring the code back into standards compliance. The alternatives would be to use another collection library, write your own, or just work with the standard libraries, and all of these may have similar problems long term.

Certainly, I have found the Google Collections Library to save me a lot of time and improve my code, and I will continue to use it for work and personal projects whenever possible.

While this article was being reviewed by various members of the Google Collections Library team, I discovered that a new version of the library is in the works. We decided to go ahead and release this article anyway, because there is no firm date for the next version yet, but I will come back and revisit any changes in a future article once the next release is available.

Thanks

I approached writing this article with a great deal of care. My intention is to spread the word about a great library, without trying to take credit for anything beyond working on an introductory article for it. As such, I would like to thank all of the great engineers at Google who have put in the work to create and improve the library over the last few years. I did originally try and pull out some names from the source code, but this does nothing to recognize the significant effort of those who improved the libraries rather than wrote the first version. Suffice to say that many Google engineers have been involved in creating and improving this library.

I would also like to specifically thank Kevin Bourrillion and Jared Levy for their work in preparing the library to be open-sourced, making sure that the open-sourcing of the library actually happened and their continuing commitment to improve it.

About the Author

Dick Wall is a developer advocate for Java technologies at Google, based in Mountain View, part of the Google Developer Program. He also co-hosts the Java Posse podcast, a regular Java-centric news and interviews show that can be found at http://javaposse.com.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date