May 28, 2020
Hot Topics:

Meet Lucene

  • By Otis Gospodnetic and Erik Hatcher
  • Send Email »
  • More Articles »
  • Understanding Lucene
  • Using the basic indexing API
  • Working with the search API
  • Considering alternative products

One of the key factors behind Lucene's popularity and success is its simplicity. The careful exposure of its indexing and searching API is a sign of the well-designed software. Consequently, you don't need in-depth knowledge about how Lucene's information indexing and retrieval work in order to start using it. Moreover, Lucene's straightforward API requires you to learn how to use only a handful of its classes.

In this chapter [From the Manning book Lucene in Action], we show you how to perform basic indexing and searching with Lucene with ready-to-use code examples. We then briefly introduce all the core elements you need to know for both of these processes. We also provide brief reviews of competing Java/non-Java, free, and commercial products.

1.1 Evolution of information organization and access

In order to make sense of the perceived complexity of the world, humans have invented categorizations, classifications, genuses, species, and other types of hierarchical organizational schemes. The Dewey decimal system for categorizing items in a library collection is a classic example of a hierarchical categorization scheme. The explosion of the Internet and electronic data repositories has brought large amounts of information within our reach. Some companies, such as Yahoo!, have made organization and classification of online data their business. With time, however, the amount of data available has become so vast that we needed alternate, more dynamic ways of finding information. Although we can classify data, trawling through hundreds or thousands of categories and sub-categories of data is no longer an efficient method for finding information.

The need to quickly locate information in the sea of data isn't limited to the Internet realm—desktop computers can store increasingly more data. Changing directories and expanding and collapsing hierarchies of folders isn't an effective way to access stored documents. Furthermore, we no longer use computers just for their raw computing abilities: They also serve as multimedia players and media storage devices. Those uses for computers require the ability to quickly find a specific piece of data; what's more, we need to make rich media—such as images, video, and audio files in various formats—easy to locate.

With this abundance of information, and with time being one of the most precious commodities for most people, we need to be able to make flexible, free-form, ad-hoc queries that can quickly cut across rigid category boundaries and find exactly what we're after while requiring the least effort possible.

To illustrate the pervasiveness of searching across the Internet and the desktop, figure 1.1 shows a search for lucene at Google. The figure includes a context menu that lets us use Google to search for the highlighted text. Figure 1.2 shows the Apple Mac OS X Finder (the counterpart to Microsoft's Explorer on Windows) and the search feature embedded at upper right. The Mac OS X music player, iTunes, also has embedded search capabilities, as shown in figure 1.3.

Click here for a larger image.

Figure 1.1 Convergence of Internet searching with Google and the web browser.

Search functionality is everywhere! All major operating systems have embedded searching. The most recent innovation is the Spotlight feature (http://www.apple.com/macosx/tiger/spotlighttech.html) announced by Steve Jobs in the next version of Mac OS X (nicknamed Tiger); it integrates indexing and searching across all file types including rich metadata specific to each type of file, such as emails, contacts, and more.1

Click here for a larger image.

Figure 1.2 Mac OS X Finder with its embedded search capability.

Click here for a larger image.

Figure 1.3 Apple's iTunes intuitively embeds search functionality.

Click here for a larger image.

Figure 1.4 Microsoft's newly acquired Lookout product, using Lucene.Net underneath.

Google has gone IPO. Microsoft has released a beta version of its MSN search engine; on a potentially related note, Microsoft acquired Lookout, a product leveraging the Lucene.Net port of Lucene to index and search Microsoft Outlook email and personal folders (as shown in figure 1.4). Yahoo! purchased Overture and is beefing up its custom search capabilities.

To understand what role Lucene plays in search, let's start from the basics and learn about what Lucene is and how it can help you with your search needs.

1.2 Understanding Lucene

Different people are fighting the same problem—information overload—using different approaches. Some have been working on novel user interfaces, some on intelligent agents, and others on developing sophisticated search tools like Lucene. Before we jump into action with code samples later in this chapter, we'll give you a high-level picture of what Lucene is, what it is not, and how it came to be.

Page 1 of 5

This article was originally published on March 17, 2005

Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date