Meet Lucene Part 2
1.6 Understanding the core searching classes
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
The following sections provide a brief introduction to these classes. We'll expand on these explanations in the chapters that follow, before we dive into more advanced topics.
IndexSearcher is = new IndexSearcher( FSDirectory.getDirectory("/tmp/index", false)); Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
We cover the details of IndexSearcher in chapter 3, along with more advanced information in chapters 5 and 6.
A Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the value of that field. Note that Term objects are also involved in the indexing process. However, they're created by Lucene's internals, so you typically don't need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery:
Query q = new TermQuery(new Term("contents", "lucene")); Hits hits = is.search(q);
This code instructs Lucene to find all documents that contain the word lucene in a field named contents. Because the TermQuery object is derived from the abstract parent class Query, you can use the Query type on the left side of the statement.
Lucene comes with a number of concrete Query subclasses. So far in this chapter we've mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery. All of these are covered in chapter 3. Query is the common, abstract parent class. It contains several utility methods, the most interesting of which is setBoost(float), described in section 3.5.9.
TermQuery is the most basic type of query supported by Lucene, and it's one of the primitive query types. It's used for matching documents that contain fields with specific values, as you've seen in the last few paragraphs.
The Hits class is a simple container of pointers to ranked search results—documents that match a given query. For performance reasons, Hits instances don't load from the index all documents that match a query, but only a small portion of them at a time. Chapter 3 describes this in more detail.
1.7 Review of alternate search products
Before you select Lucene as your IR library of choice, you may want to review other solutions in the same domain. We did some research into alternate products that you may want to consider and evaluate; this section summarizes our findings. We group these products in two major categories:
- Information Retrieval libraries
- Indexing and searching applications
The first group is smaller; it consists of full-text indexing and searching libraries similar to Lucene. Products in this group let you embed them in your application, as shown earlier in figure 1.5.
The second, larger group is made up of ready-to-use indexing and searching software. This software is typically designed to index and search a particular type of data, such as web pages, and is less flexible than software in the former group. However, some of these products also expose their lower-level API, so you can sometimes use them as IR libraries as well.
1.7.1 IR libraries
In our research for this chapter, we found two IR libraries—Egothor and Xapian—that offer a comparable set of features and are aimed at roughly the same audience: developers. We also found MG4J, which isn't an IR library but is rather a set of tools useful for building an IR library; we think developers working with IR ought to know about it. Here are our reviews of all three products.
A full-text indexing and searching Java library, Egothor uses core algorithms that are very similar to those used by Lucene. It has been in existence for several years and has a small but active developer and user community. The lead developer is Czech developer Leo Galambos, a PhD student with a solid academic background in the field of IR. He sometimes participates in Lucene's user and developer mailing list discussions.
Egothor supports an extended Boolean model, which allows it to function as both the pure Boolean model and the Vector model. You can tune which model to use via a simple query-time parameter. This software features a number of different query types, supports similar search syntax, and allows multithreaded querying, which can come in handy if you're working on a multi-CPU computer or searching remote indices.
The Egothor distribution comes with several ready-to-use applications, such as a web crawler called Capek, a file indexer with a Swing GUI, and more. It also provides parsers for several rich-text document formats, such as PDF and Microsoft Word documents. As such, Egothor and Capek are comparable to the Lucene/Nutch combination, and Egother's file indexer and document parsers are similar to the small document parsing and indexing framework presented in chapter 7 of this book.
Free, open source, and released under a BSD-like license, the Egothor project is comparable to Lucene in most aspects. If you have yet to choose a full-text indexing and searching library, you may want to evaluate Egothor in addition to Lucene. Egothor's home page is at http://www.egothor.org/; as of this writing, it features a demo of its web crawler and search functionality.
Xapian is a Probabilistic Information Retrieval library written in C++ and released under GPL. This project (or, rather, its predecessors) has an interesting history: The company that developed and owned it went through more than half a dozen acquisitions, name changes, shifts in focus, and such.
Xapian is actively developed software. It's currently at version 0.8.3, but it has a long history behind it and is based on decades of experience in the IR field. Its web site, http://www.xapian.org/, shows that it has a rich set of features, much like Lucene. It supports a wide range of queries and has a query parser that supports human-friendly search syntax; stemmers based on Dr. Martin Porter's Snowball project; parsers for a several rich-document types; bindings for Perl, Python, PHP, and (soon) Java; remote index searching; and so on.
In addition to providing an IR library, Xapian comes with a web site search application called Omega, which you can download separately.
Although MG4J (Managing Gigabytes for Java) isn't an IR library like Lucene, Egothor, and Xapian, we believe that every software engineer reading this book should be aware of it because it provides low-level support for building Java IR libraries. MG4J is named after a popular IR book, Managing Gigabytes: Compressing and Indexing Documents and Images, written by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. After collecting large amounts of web data with their distributed, fault-tolerant web crawler called UbiCrawler, its authors needed software capable of analyzing the collected data; out of that need, MG4J was born.
The library provides optimized classes for manipulating I/O, inverted index compression, and more. The project home page is at http://mg4j.dsi.unimi.it/; the library is free, open source, released under LGPL, and currently at version 0.8.2.
Page 2 of 3