Over the years I’ve written a number of articles about SQL joins, including most recently the article Choosing the Right SQL Join. But what if your data wasn’t stored in a database at all, instead spread across HTML, PDF, or Microsoft Word files, or worse, a combination of multiple file formats? Or consider a project to create a specialized search engine that would comb the most prominent corporate and research websites associated with the pharmaceutical or aeronautical industry? Indexing and searching data of this nature is no small feat, yet it remains a commonplace computing task.
Thankfully, as is often the case with common yet complex computing problems, chances are an open source project exists which at least lowers some of the “implementational” barriers. As it happens, a fantastic open source indexing and search library named Lucene offers developers a powerful tool for sifting through extraordinarily large data repositories while managing to be relatively simple to implement (particularly in a small office environment). In this article, I’ll introduce you to Lucene, explaining how to install and configure it so you can begin searching data in entirely new ways.
Installing Lucene
Lucene is written in Java (although quite a few other language-specific ports exist), therefore it is supported on all of the major operating systems, Windows included. To install Lucene, download the latest stable binary version and extract the download to a convenient location. You’ll also want to install Apache Ant. We’ll use Ant to build the demo which is bundled with the Lucene download. You also have the option of downloading the Lucene source and building it using Ant; if you choose this approach be sure to read the BUILD.txt
file bundled with the source download.
With Lucene downloaded and Ant installed, you’ll next need to add two JAR files to your CLASSPATH
, including lucene-core-3.2.0.jar
, which resides in the Lucene download’s root directory, and lucene-demo-3.2.0.jar
, which resides in the Lucene download’s contrib/demo
directory. Exactly how you go about modifying the CLASSPATH
variable is operating system-specific, so be sure to consult the Java documentation if you are unsure how to do so.
Running the Lucene Indexing and Searching Demo
With the CLASSPATH
modified, execute the following command, pointing the -docs
argument to the directory you’d like to index.
$ java org.apache.lucene.demo.IndexFiles -docs /PATH/TO/FILE/REPOSITORY
Indexing to directory 'index'...
adding /home/wjgilmore/Downloads/zf1112/INSTALL.txt
adding /home/wjgilmore/Downloads/zf1112/LICENSE.txt
adding /home/wjgilmore/Downloads/zf1112/README.txt
...
adding /home/wjgilmore/Downloads/zf1112/library/Zend/Text/Table/Column.php
adding /home/wjgilmore/Downloads/zf1112/bin/zf
adding /home/wjgilmore/Downloads/zf1112/bin/zf.php
adding /home/wjgilmore/Downloads/zf1112/bin/zf.bat
12365 total milliseconds
Lucene will immediately begin the indexing process, recursively indexing every file in the directory and saving the index to a directory named index
, which is located within the current directory (you can change the directory location using the -index
argument). Once complete you can search the index by executing the org.apache.lucene.demo.SearchFiles
JAR. In the following example I’m searching the Zend Framework source code for the term WindowsAzure
:
$ java org.apache.lucene.demo.SearchFiles
Enter query:
WindowsAzure
Searching for: windowsazure
45 total matching documents
1. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDataSources.php
2. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Storage/PageRegionInstance.php
3. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/PerformanceCounterSubscription.php
4. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/RetryPolicy/RetryPolicyAbstract.php
5. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationWindowsEventLog.php
6. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationPerformanceCounters.php
7. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationLogs.php
8. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDiagnosticInfrastructureLogs.php
9. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDirectories.php
10. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Credentials/Exception.php
Press (n)ext page, (q)uit or enter number to jump to a page.
Indexing and searching data stored across multiple file formats? Try the powerful combination of the Lucene search library and the Solr search application.
Searching Indexed Data with Solr
Newcomers to Lucene are often confused about its role in the search implementation process. Hopefully, running the Lucene demo is going to help dispel the confusion, because the demo is an application created using the Lucene library. Lucene is not a search application; rather, you’ll use its API to write your own search application. You can peruse the demo’s index and search source code by downloading the Lucene source code and then navigating to /contrib/demo/src/java/org/apache/lucene/demo
.
Presuming your search needs aren’t so exotic that you have to write a custom index and search implementation, consider using Solr, a Lucene-based search application, which bundles a great number of useful features including a Web-based interface, support for indexing and searching multiple document types, and even database integration.
Unzip the download and via the terminal navigate to the example
directory. Execute the following command:
$ java -jar start.jar
This will start the Jetty application server bundled with Solr, allowing you to access the Solr administration interface via the URL http://localhost:8983/solr/admin/
. After confirming you are able to access this interface, build the Solr search index using the sample XML files included in the Solr download by executing the following command:
$ java -jar exampledocs/post.jar exampledocs/*.xml
SimplePostTool: version 1.3
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file gb18030-example.xml
SimplePostTool: POSTing file hd.xml
SimplePostTool: POSTing file ipod_other.xml
SimplePostTool: POSTing file ipod_video.xml
SimplePostTool: POSTing file mem.xml
SimplePostTool: POSTing file monitor2.xml
SimplePostTool: POSTing file monitor.xml
SimplePostTool: POSTing file mp500.xml
SimplePostTool: POSTing file sd500.xml
SimplePostTool: POSTing file solr.xml
SimplePostTool: POSTing file utf8-example.xml
SimplePostTool: POSTing file vidcard.xml
SimplePostTool: COMMITting Solr index changes..
Now, return to the browser and within the Query String
textarea form field search for a bit of text found in the XML documents. For instance, try searching for Dell
, which is found in the document monitor.xml
.
The post.jar
utility demonstrates just one of several approaches to indexing data using Solr. See the Solr wiki for more information about alternative approaches. Further, if you continue experimenting with the post.jar
file, you’ll notice that it is restricted to indexing solely XML files. Of course, you’ll probably want to index multiple file types. If so, check out the Apache Tika project, which can be used in conjunction with Solr to index more than 30 different content types, including MP3, PST (Outlook files), and PowerPoint!
Where to From Here?
Lucene, Solr, and related projects such as Nutch offer developers an incredibly powerful solution for searching data contained within a wide variety of document types. Of course, the examples provided in this article barely scratch the surface of what’s possible. Be sure to consult the following websites for more information:
- The official Lucene project website: a powerful text search engine library
- The official Solr project website: a search implementation built atop the Lucene library
- The official Nutch project website: an open source Web crawler and search engine built atop the Lucene library
About the Author
Jason Gilmore — Contributing Editor, PHP — is the founder of EasyPHPWebsites.com, and author of the popular book, “Easy PHP Websites with the Zend Framework”. Jason is a cofounder and speaker chair of CodeMash, a nonprofit organization tasked with hosting an annual namesake developer’s conference, and was a member of the 2008 MySQL Conference speaker selection board.