- Understanding Lucene
- Using the basic indexing API
- Working with the search API
- Considering alternative products
One of the key factors behind Lucene's popularity and success is its simplicity. The careful exposure of its indexing and searching API is a sign of the well-designed software. Consequently, you don't need in-depth knowledge about how Lucene's information indexing and retrieval work in order to start using it. Moreover, Lucene's straightforward API requires you to learn how to use only a handful of its classes.
In this chapter [From the Manning book Lucene in Action], we show you how to perform basic indexing and searching with Lucene with ready-to-use code examples. We then briefly introduce all the core elements you need to know for both of these processes. We also provide brief reviews of competing Java/non-Java, free, and commercial products.
1.1 Evolution of information organization and access
The need to quickly locate information in the sea of data isn't limited to the Internet realm—desktop computers can store increasingly more data. Changing directories and expanding and collapsing hierarchies of folders isn't an effective way to access stored documents. Furthermore, we no longer use computers just for their raw computing abilities: They also serve as multimedia players and media storage devices. Those uses for computers require the ability to quickly find a specific piece of data; what's more, we need to make rich media—such as images, video, and audio files in various formats—easy to locate.
With this abundance of information, and with time being one of the most precious commodities for most people, we need to be able to make flexible, free-form, ad-hoc queries that can quickly cut across rigid category boundaries and find exactly what we're after while requiring the least effort possible.
To illustrate the pervasiveness of searching across the Internet and the desktop, figure 1.1 shows a search for lucene at Google. The figure includes a context menu that lets us use Google to search for the highlighted text. Figure 1.2 shows the Apple Mac OS X Finder (the counterpart to Microsoft's Explorer on Windows) and the search feature embedded at upper right. The Mac OS X music player, iTunes, also has embedded search capabilities, as shown in figure 1.3.
Figure 1.1 Convergence of Internet searching with Google and the web browser.
Search functionality is everywhere! All major operating systems have embedded searching. The most recent innovation is the Spotlight feature (http://www.apple.com/macosx/tiger/spotlighttech.html) announced by Steve Jobs in the next version of Mac OS X (nicknamed Tiger); it integrates indexing and searching across all file types including rich metadata specific to each type of file, such as emails, contacts, and more.1
Figure 1.2 Mac OS X Finder with its embedded search capability.
Figure 1.3 Apple's iTunes intuitively embeds search functionality.
Figure 1.4 Microsoft's newly acquired Lookout product, using Lucene.Net underneath.
Google has gone IPO. Microsoft has released a beta version of its MSN search engine; on a potentially related note, Microsoft acquired Lookout, a product leveraging the Lucene.Net port of Lucene to index and search Microsoft Outlook email and personal folders (as shown in figure 1.4). Yahoo! purchased Overture and is beefing up its custom search capabilities.
To understand what role Lucene plays in search, let's start from the basics and learn about what Lucene is and how it can help you with your search needs.
1.2 Understanding Lucene
Different people are fighting the same problem—information overload—using different approaches. Some have been working on novel user interfaces, some on intelligent agents, and others on developing sophisticated search tools like Lucene. Before we jump into action with code samples later in this chapter, we'll give you a high-level picture of what Lucene is, what it is not, and how it came to be.
Page 1 of 5