Meet Lucene, Page 3
1.2.4 Who uses Lucene
Who doesn't? In addition to those organizations mentioned on the Powered by Lucene page on Lucene's Wiki, a number of other large, well-known, multinational organizations are using Lucene. It provides searching capabilities for the Eclipse IDE, the Encyclopedia Britannica CD-ROM/DVD, FedEx, the Mayo Clinic, Hewlett-Packard, New Scientist magazine, Epiphany, MIT's OpenCourseware and DSpace, Akamai's EdgeComputing platform, and so on. Your name will be on this list soon, too.
1.2.5 Lucene ports: Perl, Python, C++, .NET, Ruby
One way to judge the success of open source software is by the number of times it's been ported to other programming languages. Using this metric, Lucene is quite a success! Although the original Lucene is written in Java, as of this writing Lucene has been ported to Perl, Python, C++, and .NET, and some groundwork has been done to port it to Ruby. This is excellent news for developers who need to access Lucene indices from applications written in different languages. You can learn more about some of these ports in chapter 9.
1.3 Indexing and searching
At the heart of all search engines is the concept of indexing: processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching. Let's take a quick high-level look at both the indexing and searching processes.
1.3.1 What is indexing, and why is it important?
Suppose you needed to search a large number of files, and you wanted to be able to find files that contained a certain word or a phrase. How would you go about writing a program to do this? A naïve approach would be to sequentially scan each file for the given word or phrase. This approach has a number of flaws, the most obvious of which is that it doesn't scale to larger file sets or cases where files are very large. This is where indexing comes in: To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.
You can think of an index as a data structure that allows fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which lets you quickly locate pages that discuss certain topics. In the case of Lucene, an index is a specially designed data structure, typically stored on the file system as a set of index files. We cover the structure of index files in detail in appendix B, but for now just think of a Lucene index as a tool that allows quick word lookup.
1.3.2 What is searching?
Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recall metrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. However, you must consider a number of other factors when thinking about searching. We already mentioned speed and the ability to quickly search large quantities of text. Support for single and multiterm queries, phrase queries, wildcards, result ranking, and sorting are also important, as is a friendly syntax for entering those queries. Lucene's powerful software library offers a number of search features, bells, and whistles—so many that we had to spread our search coverage over three chapters (chapters 3, 5, and 6).
1.4 Lucene in action: a sample application
Let's see Lucene in action. To do that, recall the problem of indexing and searching files, which we described in section 1.3.1. Furthermore, suppose you need to index and search files stored in a directory tree, not just in a single directory. To show you Lucene's indexing and searching capabilities, we'll use a pair of command-line applications: Indexer and Searcher. First we'll index a directory tree containing text files; then we'll search the created index.
These example applications will familiarize you with Lucene's API, its ease of use, and its power. The code listings are complete, ready-to-use command-line programs. If file indexing/searching is the problem you need to solve, then you can copy the code listings and tweak them to suit your needs. In the chapters that follow, we'll describe each aspect of Lucene's use in much greater detail.
Before we can search with Lucene, we need to build an index, so we start with our Indexer application.
1.4.1 Creating an index
In this section you'll see a single class called Indexer and its four static methods; together, they recursively traverse file system directories and index all files with a .txt extension. When Indexer completes execution it leaves behind a Lucene index for its sibling, Searcher (presented in section 1.4.2).
We don't expect you to be familiar with the few Lucene classes and methods used in this example—we'll explain them shortly. After the annotated code listing, we show you how to use Indexer; if it helps you to learn how Indexer is used before you see how it's coded, go directly to the usage discussion that follows the code.
Using Indexer to index text files
Listing 1.1 shows the Indexer command-line program. It takes two arguments:
- A path to a directory where we store the Lucene index
- A path to a directory that contains the files we want to index
Listing 1.1 Indexer: traverses a file system and indexes .txt files
Interestingly, the bulk of the code performs recursive directory traversal (2). Only the creation and closing of the IndexWriter (1) and four lines in the indexFile method ( (3) (4) (5) ) of Indexer involve the Lucene API—effectively six lines of code.
This example intentionally focuses on text files with .txt extensions to keep things simple while demonstrating Lucene's usage and power. In chapter 7, we'll show you how to handle nontext files, and we'll develop a small ready-to-use framework capable of parsing and indexing documents in several common formats.
From the command line, we ran Indexer against a local working directory including Lucene's own source code. We instructed Indexer to index files under the /lucene directory and store the Lucene index in the build/index directory:
% java lia.meetlucene.Indexer build/index/luceneIndexing /lucene/build/test/TestDoc/test.txtIndexing /lucene/build/test/TestDoc/test2.txtIndexing /lucene/BUILD.txtIndexing /lucene/CHANGES.txtIndexing /lucene/LICENSE.txtIndexing /lucene/README.txtIndexing /lucene/src/jsp/README.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ stemsUnicode.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/test1251.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/testKOI8.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ testUnicode.txtIndexing /lucene/src/test/org/apache/lucene/analysis/ru/ wordsUnicode.txtIndexing /lucene/todo.txtIndexing 13 files took 2205 milliseconds
Indexer prints out the names of files it indexes, so you can see that it indexes only files with the .txt extension.
Note: If you're running this application on a Windows platform command shell, you need to adjust the command line's directory and path separators. The Windows command line is java build/index c:\lucene.
When it completes indexing, Indexer prints out the number of files it indexed and the time it took to do so. Because the reported time includes both file-directory traversal and indexing, you shouldn't consider it an official performance measure. In our example, each of the indexed files was small, but roughly two seconds to index a handful of text files is reasonably impressive.
Indexing speed is a concern, and we cover it in chapter 2. But generally, searching is of even greater importance.