- Understanding Lucene
- Using the basic indexing API
- Working with the search API
- Considering alternative products
One of the key factors behind Lucene’s popularity and success is its simplicity. The careful exposure of its indexing and searching API is a sign of the well-designed software. Consequently, you don’t need in-depth knowledge about how Lucene’s information indexing and retrieval work in order to start using it. Moreover, Lucene’s straightforward API requires you to learn how to use only a handful of its classes.
In this chapter [From the Manning book Lucene in Action], we show you how to perform basic indexing and searching with Lucene with ready-to-use code examples. We then briefly introduce all the core elements you need to know for both of these processes. We also provide brief reviews of competing Java/non-Java, free, and commercial products.
1.1 Evolution of information organization and access
In order to make sense of the perceived complexity of the world, humans have invented categorizations, classifications, genuses, species, and other types of hierarchical organizational schemes. The Dewey decimal system for categorizing items in a library collection is a classic example of a hierarchical categorization scheme. The explosion of the Internet and electronic data repositories has brought large amounts of information within our reach. Some companies, such as Yahoo!, have made organization and classification of online data their business. With time, however, the amount of data available has become so vast that we needed alternate, more dynamic ways of finding information. Although we can classify data, trawling through hundreds or thousands of categories and sub-categories of data is no longer an efficient method for finding information.
The need to quickly locate information in the sea of data isn’t limited to the Internet realm—desktop computers can store increasingly more data. Changing directories and expanding and collapsing hierarchies of folders isn’t an effective way to access stored documents. Furthermore, we no longer use computers just for their raw computing abilities: They also serve as multimedia players and media storage devices. Those uses for computers require the ability to quickly find a specific piece of data; what’s more, we need to make rich media—such as images, video, and audio files in various formats—easy to locate.
With this abundance of information, and with time being one of the most precious commodities for most people, we need to be able to make flexible, free-form, ad-hoc queries that can quickly cut across rigid category boundaries and find exactly what we’re after while requiring the least effort possible.
To illustrate the pervasiveness of searching across the Internet and the desktop, figure 1.1 shows a search for lucene at Google. The figure includes a context menu that lets us use Google to search for the highlighted text. Figure 1.2 shows the Apple Mac OS X Finder (the counterpart to Microsoft’s Explorer on Windows) and the search feature embedded at upper right. The Mac OS X music player, iTunes, also has embedded search capabilities, as shown in figure 1.3.
Figure 1.1 Convergence of Internet searching with Google and the web browser.
Search functionality is everywhere! All major operating systems have embedded searching. The most recent innovation is the Spotlight feature (http://www.apple.com/macosx/tiger/spotlighttech.html) announced by Steve Jobs in the next version of Mac OS X (nicknamed Tiger); it integrates indexing and searching across all file types including rich metadata specific to each type of file, such as emails, contacts, and more.1
Figure 1.2 Mac OS X Finder with its embedded search capability.
Figure 1.3 Apple’s iTunes intuitively embeds search functionality.
Figure 1.4 Microsoft’s newly acquired Lookout product, using Lucene.Net underneath.
Google has gone IPO. Microsoft has released a beta version of its MSN search engine; on a potentially related note, Microsoft acquired Lookout, a product leveraging the Lucene.Net port of Lucene to index and search Microsoft Outlook email and personal folders (as shown in figure 1.4). Yahoo! purchased Overture and is beefing up its custom search capabilities.
To understand what role Lucene plays in search, let’s start from the basics and learn about what Lucene is and how it can help you with your search needs.
1.2 Understanding Lucene
Different people are fighting the same problem—information overload—using different approaches. Some have been working on novel user interfaces, some on intelligent agents, and others on developing sophisticated search tools like Lucene. Before we jump into action with code samples later in this chapter, we’ll give you a high-level picture of what Lucene is, what it is not, and how it came to be.
1.2.1 What Lucene is
Lucene is a high performance, scalable Information Retrieval (IR) library. It lets you add indexing and searching capabilities to your applications. Lucene is a mature, free, open-source project implemented in Java; it’s a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for a few years, the most popular free Java IR library.
Note: Throughout the book, we’ll use the term Information Retrieval (IR) to describe search tools like Lucene. People often refer to IR libraries as search engines, but you shouldn’t confuse IR libraries with web search engines.
As you’ll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn’t make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications.
People new to Lucene often mistake it for a ready-to-use application like a file-search program, a web crawler, or a web site search engine. That isn’t what Lucene is: Lucene is a software library, a toolkit if you will, not a full-featured search application. It concerns itself with text indexing and searching, and it does those things very well. Lucene lets your application deal with business rules specific to its problem domain while hiding the complexity of indexing and searching implementation behind a simple-to-use API. You can think of Lucene as a layer that applications sit on top of, as depicted in figure 1.5.
A number of full-featured search applications have been built on top of Lucene. If you’re looking for something prebuilt or a framework for crawling, document handling, and searching, consult the Lucene Wiki “powered by” page (http://wiki.apache.org/jakarta-lucene/PoweredBy) for many options: Zilverline, SearchBlox, Nutch, LARM, and jSearch, to name a few. Case studies of both Nutch and SearchBlox are included in chapter 10.
1.2.2 What Lucene can do for you
Lucene allows you to add indexing and searching capabilities to your applications (these functions are described in section 1.3). Lucene can index and make searchable any data that can be converted to a textual format. As you can see in figure 1.5, Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information.
Figure 1.5 A typical application integration with Lucene.
Similarly, with Lucene’s help you can index data stored in your databases, giving your users full-text search capabilities that many databases don’t provide. Once you integrate Lucene, users of your applications can make searches such as +George +Rice -eat -pudding, Apple pie +Tiger, animal:monkey AND food:banana, and so on. With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your Wiki pages … the list goes on.
1.2.3 History of Lucene
Lucene was originally written by Doug Cutting;2 it was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation’s Jakarta family of high-quality open source Java products in September 2001. With each release since then, the project has enjoyed increased visibility, attracting more users and developers. As of July 2004, Lucene version 1.4 has been released, with a bug fix 1.4.2 release in early October. Table 1.1 shows Lucene’s release history.
Table 1.1 Lucene’s release history
|0.01||March 2000||First open source release (SourceForge)|
|1.01b||July 2001||Last SourceForge release|
|1.2||June 2002||First Apache Jakarta release|
|1.3||December 2003||Compound index format, QueryParser enhancements, remote searching, token positioning, extensible scoring API|
|1.4||July 2004||Sorting, span queries, term vectors|
|1.4.1||August 2004||Bug fix for sorting performance|
|1.4.2||October 2004||IndexSearcher optimization and misc. fixes|
|1.4.3||Winter 2004||Misc. fixes|
Note: Lucene’s creator, Doug Cutting, has significant theoretical and practical experience in the field of IR. He’s published a number of research papers on various IR topics and has worked for companies such as Excite, Apple, and Grand Central. Most recently, worried about the decreasing number of web search engines and a potential monopoly in that realm, he created Nutch, the first open-source World-Wide Web search engine (http://www.nutch.org); it’s designed to handle crawling, indexing, and searching of several billion frequently updated web pages. Not surprisingly, Lucene is at the core of Nutch; section 10.1 includes a case study of how Nutch leverages Lucene.
Doug Cutting remains the main force behind Lucene, but more bright minds have joined the project since Lucene’s move under the Apache Jakarta umbrella. At the time of this writing, Lucene’s core team includes about half a dozen active developers, two of whom are authors of this book. In addition to the official project developers, Lucene has a fairly large and active technical user community that frequently contributes patches, bug fixes, and new features.
1.2.4 Who uses Lucene
Who doesn’t? In addition to those organizations mentioned on the Powered by Lucene page on Lucene’s Wiki, a number of other large, well-known, multinational organizations are using Lucene. It provides searching capabilities for the Eclipse IDE, the Encyclopedia Britannica CD-ROM/DVD, FedEx, the Mayo Clinic, Hewlett-Packard, New Scientist magazine, Epiphany, MIT’s OpenCourseware and DSpace, Akamai’s EdgeComputing platform, and so on. Your name will be on this list soon, too.
1.2.5 Lucene ports: Perl, Python, C++, .NET, Ruby
One way to judge the success of open source software is by the number of times it’s been ported to other programming languages. Using this metric, Lucene is quite a success! Although the original Lucene is written in Java, as of this writing Lucene has been ported to Perl, Python, C++, and .NET, and some groundwork has been done to port it to Ruby. This is excellent news for developers who need to access Lucene indices from applications written in different languages. You can learn more about some of these ports in chapter 9.
1.3 Indexing and searching
At the heart of all search engines is the concept of indexing: processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching. Let’s take a quick high-level look at both the indexing and searching processes.
1.3.1 What is indexing, and why is it important?
Suppose you needed to search a large number of files, and you wanted to be able to find files that contained a certain word or a phrase. How would you go about writing a program to do this? A naïve approach would be to sequentially scan each file for the given word or phrase. This approach has a number of flaws, the most obvious of which is that it doesn’t scale to larger file sets or cases where files are very large. This is where indexing comes in: To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.
You can think of an index as a data structure that allows fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which lets you quickly locate pages that discuss certain topics. In the case of Lucene, an index is a specially designed data structure, typically stored on the file system as a set of index files. We cover the structure of index files in detail in appendix B, but for now just think of a Lucene index as a tool that allows quick word lookup.
1.3.2 What is searching?
Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recall metrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. However, you must consider a number of other factors when thinking about searching. We already mentioned speed and the ability to quickly search large quantities of text. Support for single and multiterm queries, phrase queries, wildcards, result ranking, and sorting are also important, as is a friendly syntax for entering those queries. Lucene’s powerful software library offers a number of search features, bells, and whistles—so many that we had to spread our search coverage over three chapters (chapters 3, 5, and 6).
1.4 Lucene in action: a sample application
Let’s see Lucene in action. To do that, recall the problem of indexing and searching files, which we described in section 1.3.1. Furthermore, suppose you need to index and search files stored in a directory tree, not just in a single directory. To show you Lucene’s indexing and searching capabilities, we’ll use a pair of command-line applications: Indexer and Searcher. First we’ll index a directory tree containing text files; then we’ll search the created index.
These example applications will familiarize you with Lucene’s API, its ease of use, and its power. The code listings are complete, ready-to-use command-line programs. If file indexing/searching is the problem you need to solve, then you can copy the code listings and tweak them to suit your needs. In the chapters that follow, we’ll describe each aspect of Lucene’s use in much greater detail.
Before we can search with Lucene, we need to build an index, so we start with our Indexer application.
1.4.1 Creating an index
In this section you’ll see a single class called Indexer and its four static methods; together, they recursively traverse file system directories and index all files with a .txt extension. When Indexer completes execution it leaves behind a Lucene index for its sibling, Searcher (presented in section 1.4.2).
We don’t expect you to be familiar with the few Lucene classes and methods used in this example—we’ll explain them shortly. After the annotated code listing, we show you how to use Indexer; if it helps you to learn how Indexer is used before you see how it’s coded, go directly to the usage discussion that follows the code.
Using Indexer to index text files
Listing 1.1 shows the Indexer command-line program. It takes two arguments:
- A path to a directory where we store the Lucene index
- A path to a directory that contains the files we want to index
Listing 1.1 Indexer: traverses a file system and indexes .txt files
Interestingly, the bulk of the code performs recursive directory traversal (2). Only the creation and closing of the IndexWriter (1) and four lines in the indexFile method ( (3) (4) (5) ) of Indexer involve the Lucene API—effectively six lines of code.
This example intentionally focuses on text files with .txt extensions to keep things simple while demonstrating Lucene’s usage and power. In chapter 7, we’ll show you how to handle nontext files, and we’ll develop a small ready-to-use framework capable of parsing and indexing documents in several common formats.
From the command line, we ran Indexer against a local working directory including Lucene’s own source code. We instructed Indexer to index files under the /lucene directory and store the Lucene index in the build/index directory:
% java lia.meetlucene.Indexer build/index/luceneIndexing /lucene/build/test/TestDoc/test.txtIndexing /lucene/build/test/TestDoc/test2.txtIndexing /lucene/BUILD.txtIndexing /lucene/CHANGES.txtIndexing /lucene/LICENSE.txtIndexing /lucene/README.txtIndexing /lucene/src/jsp/README.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ stemsUnicode.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/test1251.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/testKOI8.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ testUnicode.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ wordsUnicode.txtIndexing /lucene/todo.txtIndexing 13 files took 2205 milliseconds
Indexer prints out the names of files it indexes, so you can see that it indexes only files with the .txt extension.
Note: If you’re running this application on a Windows platform command shell, you need to adjust the command line’s directory and path separators. The Windows command line is java build/index c:lucene.
When it completes indexing, Indexer prints out the number of files it indexed and the time it took to do so. Because the reported time includes both file-directory traversal and indexing, you shouldn’t consider it an official performance measure. In our example, each of the indexed files was small, but roughly two seconds to index a handful of text files is reasonably impressive.
Indexing speed is a concern, and we cover it in chapter 2. But generally, searching is of even greater importance.
1.4.2 Searching an index
Searching in Lucene is as fast and simple as indexing; the power of this functionality is astonishing, as chapters 3 and 5 will show you. For now, let’s look at Searcher, a command-line program that we’ll use to search the index created by Indexer. (Keep in mind that our Searcher serves the purpose of demonstrating the use of Lucene’s search API. Your search application could also take a form of a web or desktop application with a GUI, an EJB, and so on.)
In the previous section, we indexed a directory of text files. The index, in this example, resides in a directory of its own on the file system. We instructed Indexer to create a Lucene index in a build/index directory, relative to the directory from which we invoked Indexer. As you saw in listing 1.1, this index contains the indexed files and their absolute paths. Now we need to use Lucene to search that index in order to find files that contain a specific piece of text. For instance, we may want to find all files that contain the keyword java or lucene, or we may want to find files that include the phrase “system requirements”.
Using Searcher to implement a search
The Searcher program complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:
- The path to the index created with Indexer
- A query to use to search the index
Listing 1.2 Searcher: searches a Lucene index for a query passed as an argument.
Searcher, like its Indexer sibling, has only a few lines of code dealing with Lucene. A couple of special things occur in the search method,
Editor’s note: The following numbered steps refer to the numbers in Listing 1.2.
- We use Lucene’s IndexSearcher and FSDirectory classes to open our index for searching.
- We use QueryParser to parse a human-readable query into Lucene’s Query class.
- Searching returns hits in the form of a Hits object.
- Note that the Hits object contains only references to the underlying documents. In other words, instead of being loaded immediately upon search, matches are loaded from the index in a lazy fashion—only when requested with the hits.doc(int) call.
Let’s run Searcher and find some documents in our index using the query 'lucene':
%java lia.meetlucene.Searcher build/index 'lucene'Found 6 document(s) (in 66 milliseconds) that matched query 'lucene':/lucene/README.txt/lucene/src/jsp/README.txt/lucene/BUILD.txt/lucene/todo.txt/lucene/LICENSE.txt/lucene/CHANGES.txt
The output shows that 6 of the 13 documents we indexed with Indexer contain the word lucene and that the search took a meager 66 milliseconds. Because Indexer stores files’ absolute paths in the index, Searcher can print them out. It’s worth noting that storing the file path as a field was our decision and appropriate in this case, but from Lucene’s perspective it’s arbitrary meta-data attached to indexed documents.
Of course, you can use more sophisticated queries, such as 'lucene AND doug' or 'lucene AND NOT slow' or '+lucene +book', and so on. Chapters 3, 5, and 6 cover all different aspects of searching, including Lucene’s query syntax.
Using the xargs utility
The Searcher class is a simplistic demo of Lucene’s search features. As such, it only dumps matches to the standard output. However, Searcher has one more trick up its sleeve. Imagine that you need to find files that contain a certain keyword or phrase, and then you want to process the matching files in some way. To keep things simple, let’s imagine that you want to list each matching file using the ls UNIX command, perhaps to see the file size, permission bits, or owner. By having matching document paths written unadorned to the standard output, and having the statistical output written to standard error, you can use the nifty UNIX xargs utility to process the matched files, as shown here:
% java lia.meetlucene.Searcher build/index 'lucene AND NOT slow' | xargs ls -lFound 6 document(s) (in 131 milliseconds) that matched query 'lucene AND NOT slow':-rw-r--r-- 1 erik staff 4215 10 Sep 21:51 /lucene/BUILD.txt-rw-r--r-- 1 erik staff 17889 28 Dec 10:53 /lucene/CHANGES.txt-rw-r--r-- 1 erik staff 2670 4 Nov 2001 /lucene/LICENSE.txt-rw-r--r-- 1 erik staff 683 4 Nov 2001 /lucene/README.txt-rw-r--r-- 1 erik staff 370 26 Jan 2002 /lucene/src/jsp/ README.txt-rw-r--r-- 1 erik staff 943 18 Sep 21:27 /lucene/todo.txt
In this example, we chose the Boolean query 'lucene AND NOT slow', which finds all files that contain the word lucene and don’t contain the word slow. This query took 131 milliseconds and found 6 matching files. We piped Searcher‘s output to the xargs command, which in turn used the ls l command to list each matching file. In a similar fashion, the matched files could be copied, concatenated, emailed, or dumped to standard output.3
Our example indexing and searching applications demonstrate Lucene in a lot of its glory. Its API usage is simple and unobtrusive. The bulk of the code (and this applies to all applications interacting with Lucene) is plumbing relating to the business purpose—in this case, Indexer‘s file system crawler that looks for text files and Searcher‘s code that prints matched filenames based on a query to the standard output. But don’t let this fact, or the conciseness of the examples, tempt you into complacence: There is a lot going on under the covers of Lucene, and we’ve used quite a few best practices that come from experience. To effectively leverage Lucene, it’s important to understand more about how it works and how to extend it when the need arises. The remainder of this book is dedicated to giving you these missing pieces.
|More to Come|
The rest of this sample chapter will appear on our website starting March 31st.
1 Erik freely admits to his fondness of all things Apple.
2 Lucene is Doug’s wife’s middle name; it’s also her maternal grandmother’s first name.
3 Neal Stephenson details this process nicely in “In the Beginning Was the Command Line”: http://www.cryptonomicon.com/beginning.html.
About the Authors
Erik Hatcher codes, writes, and speaks on technical topics that he finds fun and challenging. He has written software for a number of diverse industries using many diffedifferentnologies and languages. Erik coauthored Java Development with Ant (Manning, 2002) with Steve Loughran, a book that has received wonderful industry acclaim. Since the release of Erik’s first book, he has spoken at numerous venues including the No Fluff, Just Stuff symposium circuit, JavaOne, O’Reilly’s Open Source Convention, the Open Source Content Management Conference, and many Java User Group meetings. As an Apache Software Foundation member, he is an active contributor and committer on several Apache projects including Lucene, Ant, and Tapestry. Erik currently works at the University of Virginia’s Humanities department supporting Applied Research in Patacriticism.
Otis Gospodnetic has been an active Lucene developer for four years and maintains the jGuru Lucene FAQ. He is a Software Engineer at Wireless Generations, a company that develops technology solutions for educational assessments of students and teachers. In his spare time, he develops Simpy, a Personal Web Service that uses Lucene, which he created out of his passion for knowledge, information retrieval, and management. Previous technical publications include several articles about Lucene, published by O’Reilly Network and IBM developerWorks. Otis also wrote To Choose and Be Chosen: Pursuing Education in America, a guidebook for foreigners wishing to study in the United States; it’s based on his own experience.
About the Book
Lucene in Action by Erik Hatcher and Otis Gospodnetic
Foreword by Doug Cutting, the inventor of Lucene
Published December 2004, Softbound, 456 pages
Published by Manning Publications Co.
Retail price: $44.95
Ebook price: $22.50. To purchase the ebook go to http://www.manning.com/hatcher2.
This material is from Chapter 1 of the book.