Meet Lucene, Page 2
1.2.1 What Lucene is
Lucene is a high performance, scalable Information Retrieval (IR) library. It lets you add indexing and searching capabilities to your applications. Lucene is a mature, free, open-source project implemented in Java; it's a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for a few years, the most popular free Java IR library.
Note: Throughout the book, we'll use the term Information Retrieval (IR) to describe search tools like Lucene. People often refer to IR libraries as search engines, but you shouldn't confuse IR libraries with web search engines.
As you'll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn't make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications.
People new to Lucene often mistake it for a ready-to-use application like a file-search program, a web crawler, or a web site search engine. That isn't what Lucene is: Lucene is a software library, a toolkit if you will, not a full-featured search application. It concerns itself with text indexing and searching, and it does those things very well. Lucene lets your application deal with business rules specific to its problem domain while hiding the complexity of indexing and searching implementation behind a simple-to-use API. You can think of Lucene as a layer that applications sit on top of, as depicted in figure 1.5.
A number of full-featured search applications have been built on top of Lucene. If you're looking for something prebuilt or a framework for crawling, document handling, and searching, consult the Lucene Wiki "powered by" page (http://wiki.apache.org/jakarta-lucene/PoweredBy) for many options: Zilverline, SearchBlox, Nutch, LARM, and jSearch, to name a few. Case studies of both Nutch and SearchBlox are included in chapter 10.
1.2.2 What Lucene can do for you
Lucene allows you to add indexing and searching capabilities to your applications (these functions are described in section 1.3). Lucene can index and make searchable any data that can be converted to a textual format. As you can see in figure 1.5, Lucene doesn't care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information.
Figure 1.5 A typical application integration with Lucene.
Similarly, with Lucene's help you can index data stored in your databases, giving your users full-text search capabilities that many databases don't provide. Once you integrate Lucene, users of your applications can make searches such as +George +Rice -eat -pudding, Apple pie +Tiger, animal:monkey AND food:banana, and so on. With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your Wiki pages ... the list goes on.
1.2.3 History of Lucene
Lucene was originally written by Doug Cutting;2 it was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of high-quality open source Java products in September 2001. With each release since then, the project has enjoyed increased visibility, attracting more users and developers. As of July 2004, Lucene version 1.4 has been released, with a bug fix 1.4.2 release in early October. Table 1.1 shows Lucene's release history.
Table 1.1 Lucene's release history
|0.01||March 2000||First open source release (SourceForge)|
|1.01b||July 2001||Last SourceForge release|
|1.2||June 2002||First Apache Jakarta release|
|1.3||December 2003||Compound index format, QueryParser enhancements, remote searching, token positioning, extensible scoring API|
|1.4||July 2004||Sorting, span queries, term vectors|
|1.4.1||August 2004||Bug fix for sorting performance|
|1.4.2||October 2004||IndexSearcher optimization and misc. fixes|
|1.4.3||Winter 2004||Misc. fixes|
Note: Lucene's creator, Doug Cutting, has significant theoretical and practical experience in the field of IR. He's published a number of research papers on various IR topics and has worked for companies such as Excite, Apple, and Grand Central. Most recently, worried about the decreasing number of web search engines and a potential monopoly in that realm, he created Nutch, the first open-source World-Wide Web search engine (http://www.nutch.org); it's designed to handle crawling, indexing, and searching of several billion frequently updated web pages. Not surprisingly, Lucene is at the core of Nutch; section 10.1 includes a case study of how Nutch leverages Lucene.
Doug Cutting remains the main force behind Lucene, but more bright minds have joined the project since Lucene's move under the Apache Jakarta umbrella. At the time of this writing, Lucene's core team includes about half a dozen active developers, two of whom are authors of this book. In addition to the official project developers, Lucene has a fairly large and active technical user community that frequently contributes patches, bug fixes, and new features.