September 16, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Meet Lucene

  • March 17, 2005
  • By Otis Gospodnetic and Erik Hatcher
  • Send Email »
  • More Articles »

1.2.1 What Lucene is

Lucene is a high performance, scalable Information Retrieval (IR) library. It lets you add indexing and searching capabilities to your applications. Lucene is a mature, free, open-source project implemented in Java; it's a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for a few years, the most popular free Java IR library.

Note: Throughout the book, we'll use the term Information Retrieval (IR) to describe search tools like Lucene. People often refer to IR libraries as search engines, but you shouldn't confuse IR libraries with web search engines.

As you'll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn't make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications.

People new to Lucene often mistake it for a ready-to-use application like a file-search program, a web crawler, or a web site search engine. That isn't what Lucene is: Lucene is a software library, a toolkit if you will, not a full-featured search application. It concerns itself with text indexing and searching, and it does those things very well. Lucene lets your application deal with business rules specific to its problem domain while hiding the complexity of indexing and searching implementation behind a simple-to-use API. You can think of Lucene as a layer that applications sit on top of, as depicted in figure 1.5.

A number of full-featured search applications have been built on top of Lucene. If you're looking for something prebuilt or a framework for crawling, document handling, and searching, consult the Lucene Wiki "powered by" page (http://wiki.apache.org/jakarta-lucene/PoweredBy) for many options: Zilverline, SearchBlox, Nutch, LARM, and jSearch, to name a few. Case studies of both Nutch and SearchBlox are included in chapter 10.

1.2.2 What Lucene can do for you

Lucene allows you to add indexing and searching capabilities to your applications (these functions are described in section 1.3). Lucene can index and make searchable any data that can be converted to a textual format. As you can see in figure 1.5, Lucene doesn't care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information.



Click here for a larger image.

Figure 1.5 A typical application integration with Lucene.

Similarly, with Lucene's help you can index data stored in your databases, giving your users full-text search capabilities that many databases don't provide. Once you integrate Lucene, users of your applications can make searches such as +George +Rice -eat -pudding, Apple pie +Tiger, animal:monkey AND food:banana, and so on. With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your Wiki pages ... the list goes on.

1.2.3 History of Lucene

Lucene was originally written by Doug Cutting;2 it was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of high-quality open source Java products in September 2001. With each release since then, the project has enjoyed increased visibility, attracting more users and developers. As of July 2004, Lucene version 1.4 has been released, with a bug fix 1.4.2 release in early October. Table 1.1 shows Lucene's release history.

Table 1.1 Lucene's release history

VersionRelease dateMilestones
0.01March 2000First open source release (SourceForge)
1.0October 2000 
1.01bJuly 2001Last SourceForge release
1.2June 2002First Apache Jakarta release
1.3December 2003Compound index format, QueryParser enhancements, remote searching, token positioning, extensible scoring API
1.4July 2004Sorting, span queries, term vectors
1.4.1August 2004Bug fix for sorting performance
1.4.2October 2004IndexSearcher optimization and misc. fixes
1.4.3Winter 2004Misc. fixes
Note: Lucene's creator, Doug Cutting, has significant theoretical and practical experience in the field of IR. He's published a number of research papers on various IR topics and has worked for companies such as Excite, Apple, and Grand Central. Most recently, worried about the decreasing number of web search engines and a potential monopoly in that realm, he created Nutch, the first open-source World-Wide Web search engine (http://www.nutch.org); it's designed to handle crawling, indexing, and searching of several billion frequently updated web pages. Not surprisingly, Lucene is at the core of Nutch; section 10.1 includes a case study of how Nutch leverages Lucene.

Doug Cutting remains the main force behind Lucene, but more bright minds have joined the project since Lucene's move under the Apache Jakarta umbrella. At the time of this writing, Lucene's core team includes about half a dozen active developers, two of whom are authors of this book. In addition to the official project developers, Lucene has a fairly large and active technical user community that frequently contributes patches, bug fixes, and new features.





Page 2 of 5



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel