August 22, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Implement Data Indexing and Search with Lucene and Solr

  • July 5, 2011
  • By Jason Gilmore
  • Send Email »
  • More Articles »

Over the years I've written a number of articles about SQL joins, including most recently the article Choosing the Right SQL Join. But what if your data wasn't stored in a database at all, instead spread across HTML, PDF, or Microsoft Word files, or worse, a combination of multiple file formats? Or consider a project to create a specialized search engine that would comb the most prominent corporate and research websites associated with the pharmaceutical or aeronautical industry? Indexing and searching data of this nature is no small feat, yet it remains a commonplace computing task.

Thankfully, as is often the case with common yet complex computing problems, chances are an open source project exists which at least lowers some of the "implementational" barriers. As it happens, a fantastic open source indexing and search library named Lucene offers developers a powerful tool for sifting through extraordinarily large data repositories while managing to be relatively simple to implement (particularly in a small office environment). In this article, I'll introduce you to Lucene, explaining how to install and configure it so you can begin searching data in entirely new ways.

Installing Lucene

Lucene is written in Java (although quite a few other language-specific ports exist), therefore it is supported on all of the major operating systems, Windows included. To install Lucene, download the latest stable binary version and extract the download to a convenient location. You'll also want to install Apache Ant. We'll use Ant to build the demo which is bundled with the Lucene download. You also have the option of downloading the Lucene source and building it using Ant; if you choose this approach be sure to read the BUILD.txt file bundled with the source download.

With Lucene downloaded and Ant installed, you'll next need to add two JAR files to your CLASSPATH, including lucene-core-3.2.0.jar, which resides in the Lucene download's root directory, and lucene-demo-3.2.0.jar, which resides in the Lucene download's contrib/demo directory. Exactly how you go about modifying the CLASSPATH variable is operating system-specific, so be sure to consult the Java documentation if you are unsure how to do so.

Running the Lucene Indexing and Searching Demo

With the CLASSPATH modified, execute the following command, pointing the -docs argument to the directory you'd like to index.

$ java org.apache.lucene.demo.IndexFiles -docs /PATH/TO/FILE/REPOSITORY
Indexing to directory 'index'...
adding /home/wjgilmore/Downloads/zf1112/INSTALL.txt
adding /home/wjgilmore/Downloads/zf1112/LICENSE.txt
adding /home/wjgilmore/Downloads/zf1112/README.txt
...
adding /home/wjgilmore/Downloads/zf1112/library/Zend/Text/Table/Column.php
adding /home/wjgilmore/Downloads/zf1112/bin/zf
adding /home/wjgilmore/Downloads/zf1112/bin/zf.php
adding /home/wjgilmore/Downloads/zf1112/bin/zf.bat
12365 total milliseconds

Lucene will immediately begin the indexing process, recursively indexing every file in the directory and saving the index to a directory named index, which is located within the current directory (you can change the directory location using the -index argument). Once complete you can search the index by executing the org.apache.lucene.demo.SearchFiles JAR. In the following example I'm searching the Zend Framework source code for the term WindowsAzure:

$ java org.apache.lucene.demo.SearchFiles
Enter query: 
WindowsAzure
Searching for: windowsazure
45 total matching documents
1. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDataSources.php
2. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Storage/PageRegionInstance.php
3. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/PerformanceCounterSubscription.php
4. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/RetryPolicy/RetryPolicyAbstract.php
5. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationWindowsEventLog.php
6. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationPerformanceCounters.php
7. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationLogs.php
8. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDiagnosticInfrastructureLogs.php
9. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Diagnostics/ConfigurationDirectories.php
10. /home/wjgilmore/Downloads/zf1112/library/Zend/Service/WindowsAzure/Credentials/Exception.php
Press (n)ext page, (q)uit or enter number to jump to a page.

Tags: search, indexing

Originally published on http://www.developer.com.

Page 1 of 2



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel