JavaData & JavaFull-Text Search with Apache Lucene 5.3.x/5.4.x

Full-Text Search with Apache Lucene 5.3.x/5.4.x

Before we delve into Apache Lucene, the following are the most important terms that you need to be familiar with. This will also help you clarify a few terms before getting into search or information retrieval:

We’ll start with Apache Lucene 5.3.x/5.4.y. The most important aspects of Lucene are mentioned under each heading.

Apache Lucene introduction

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

It’s an open source project available for free download, a cross-platform solution that offers scalable, high-performance indexing and powerful, accurate and efficient search algorithms.

Lucene terms and concepts

  • Inverted Index is used to get traversed from the string or search term to the document ids or locations of these terms. If we were to visualize this in terms of an index, it would be inverted, as we would be using the term as a handle to retrieve id or locations – the reverse of the popular usage of an index.
  • Index is a handle (information) that can be used to get related information from a file, database or any other source of data. Usually, Index is also accompanied by compression, check-sum, hash or location of the remaining data. Index contains multiple Documents.
  • Document is a collection of Fields and the Values against each of the Fields. It is more like saying that “Employee Name” – “Sumith Puri” | “Employee Designation” – “Software Architect” | “Employee Age” – “33” | “Employee ID” – “067X” forms a document. The Lucene indexing process adds multiple documents to an Index. The entire set of Documents is called the Corpus.
  • Field contains Terms and are simply sets of tokens of information. The Lucene indexing process takes care to identify (or process) fields and index them. Fields belong to a Document always.
  • Terms are a Token or String of Information. This Term is the smallest piece of Information that will be Indexed to form the Inverted Index. The set of distinct Terms is called the Vocabulary.
  • String is simply a Token or an English language string.
  • Segment is a fragmented or chunked part of the entire Index, for better storage and faster retrieval.

Lucene Segment (indexing)

Each segment index maintains the following:

  • Field Names: This contains the set of field names used in the index.
  • Stored Field Values: This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.
  • Term Dictionary: A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term’s frequency and proximity data.
  • Term Frequency Data: For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document, unless frequencies are omitted (IndexOptions.DOCS_ONLY).
  • Term Proximity Data: For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents omit position data.
  • Normalization Factors: For each field in each document, a value is stored that is multiplied into the score for hits on that field.
  • Term Vectors: For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors.
  • Deleted Documents: An optional file indicating which documents are deleted.

Lucene Internals (architecture)

Here are the Lucene architectural layers and segment search:

 

 

And here is the typical data flow in a Lucene real-world application:

Lucene Analysis (analysis/process)

Pre-Tokenization: Stripping HTML markup, transforming or removing text matching arbitrary patterns or sets of fixed strings.

Post-Tokenization:

  • Stemming – Replacing words with their stems. For instance, with English stemming, “bikes” is replaced with “bike”; now the query “bike” can find both documents containing “bike” and those containing “bikes”.
  • Stop Words Filtering – Common words like “the”, “and” and “a” rarely add any value to a search. Removing them shrinks the index size and increases performance. It may also reduce some “noise” and actually improve search quality.
  • Text Normalization – Stripping accents and other character markings can make for better searching.
  • Synonym Expansion – Adding in synonyms at the same token position as the current word can mean better matching when users search with words in the synonym set.

Below is an example Lucene analysis of a text/sentence:

Lucene on Maven (Build)

Here are typical Lucene Maven dependencies (without hibernate search):

Lucene sample (indexing)

  • Some important points before you start indexing and searching using Apache Lucene:
  • It’s integrated/merged with many existing frameworks like Hibernate to form Hibernate Search
  • Multiple types of data can be indexed such as files or databases
  • Multiple ways to analyze data suiting your application, including default StandardAnalyzer
  • Index to multiple storage modes such as directly to filesystem or to memory
  • Multiple types of queries suiting every need such as TermQuery and PrefixQuery
  • Use it as standalone or inside an application or web server
  • Adapted to multiple programming languages and frameworks including Java, JEE and .NET

 Here is an indexing sample (without Hibernate Search):


Lucene sample (searching)

 Here’s a search sample (without Hibernate Search):



Lucene synchronization (reindexing)

I will soon be uploading the best way to reindex without causing any downtime to applications. This is for applications with SLAs of 99.99% or close to that.

Here are some important Lucene queries (important types – points to JavaDoc or Definition directly):

  • Term Query
  • Prefix Query
  • Phrase Query
  • Wildcard Query
  • Fuzzy Query
  • Boolean Query

Best practices when using Apache Lucene (5.3.x/5.4.y) for indexing and searching also include the following. These are directly from the Apache Lucene Wiki, with some modifications.


Best practices for searching using Apache Lucene 5.3.x/5.4.y:


You can download the entire sample project (Eclipse) with its source code here. This simple-file-search is also called “brahmashira” [the 4x stronger weapon than brahmastra from lord brahma]. Include it directly in your projects and start indexing and searching.

 

You can Download  and Deploy on Apache Tomcat 8.0.x and run the search engine at http://localhost:8080/simple-file-search and then use search terms from static data on display there. Modify the content file and try reindexing and searching again.

Sumith is a Principal Java/Java EE Architect, a Hard-Core Java // Jakarta EE Developer with 16 (and counting) years of experience. He’s a Senior Member of ACM and IEEE, DZone Core, Member, CSI*; DZone MVB, and Java Code Geek. He holds a Bachelor of Engineering (Information Science & Engineering) from Sri Revana Siddeshwara Institute of Technology, completed the Executive Program in Data Mining & Analytics at the Indian Institute of Technology, and the Executive Certificate Program in Entrepreneurship at the Indian Institute of Management. His experience includes SCJP 1.4, SCJP 5.0, SCBCD 1.3, SCBCD 5.0, BB Spring 2.x*, BB Hibernate 3.x*, BB Java EE 6.x*, Quest C, Quest C++ and Quest Data Structures.

Latest Posts

Related Stories