dcsimg
December 4, 2016
Hot Topics:

Elasticsearch: Analyzing Text with the Analyze API

This article is an excerpt from Elasticsearch in Action by Manning Publishing.

By Radu Gheorghe, Matthew Lee Hinman, and Roy Russo.

Using the analyze API to test an analysis process can be extremely helpful when tracking down how information is being stored in your Elasticsearch indices. This API allows you to send any text to Elasticsearch, specifying what analyzer, tokenizer, or token filters to use, and get back the analyzed tokens. The following listing shows an example of what the analyze API looks like, using the standard analyzer to analyze the text "I love Bears and Fish."

Listing 1: Example of using the analyze API

% curl -XPOST 'localhost:9200/_analyze?analyzer=standard'
   -d'I love Bears and Fish.'
{
   "tokens": [
      {
         "end_offset": 1,
         "position": 1,
         "start_offset": 0,
         "token": "i",          #A
         "type": "<ALPHANUM>"
      },
       {
         "end_offset": 6,
         "position": 2,
         "start_offset": 2,
         "token": "love",       #A
         "type": "<ALPHANUM>"
      },
      {
         "end_offset": 12,
         "position": 3,
         "start_offset": 7,
         "token": "bears",      #A

         "type": "<ALPHANUM>"
      },
      {
         "end_offset": 16,
         "position": 4,
         "start_offset": 13,
         "token": "and",        #A
         "type": "<ALPHANUM>"
      },
      {
         "end_offset": 21,
         "position": 5,
         "start_offset": 17,
         "token": "fish",       #A
         "type": "<ALPHANUM>"
      }
   ]
}

#A The analyzed tokens: "i", "love", "bears", "and", and "fish"

The most important output from the analysis API is the token key. The output is a list of these maps, which gives you a representation of what the processed tokens (the ones that are going to actually be written to the index!) look like. For example, with the text "I love Bears and Fish." you get back five tokens: i, love, bears, and, fish. Notice that in this case, with the standard analyzer, each token was lowercased, and the punctuation at the end of the sentence was removed. This is a great way to test documents to see how Elasticsearch will analyze them, and it has quite a few ways to customize the analysis that's performed on the text.

Selecting an analyzer

If you already have an analyzer in mind and want to see how it handles some text, you can set the tt>analyzerparameter to the name of the analyzer. We'll go over the different uilt-in analyzers in the next section, so keep this in mind if you want to try out any of them!

If you configured an analyzer in your elasticsearch.yml file, you can also reference it by name in the analyzer parameter. Additionally, if you've created an index with a custom analyzer similar to the example in listing 5.2, you can still use this analyzer by name, but instead of using the HTTP endpoint of /_search, you'll need to specify the index first. An example using the index named veggies and an analyzer called myanalyzer is shown here:

% curl -XPOST 'localhost:9200/veggies/
   _analyze?analyzer=myanalyzer' -d'...'

Combining parts to create an impromptu analyzer

Sometimes you may not want to use a built-in analyzer but instead try out a combination of tokenizers and token filters, for instance, to see how a particular tokenizer breaks up a sentence without any other analysis. With the analysis API you can specify a tokenizer and a list of token filters to be used for analyzing the text. For example, if you wanted to use the whitespace tokenizer (to split the text on spaces) and then use the lowercase and reverse token filters, you could do so as follows:

% curl -XPOST
'localhost:9200/_analyze?tokenizer=
   whitespace&filters=lowercase,reverse' -d
'I love Bears and Fish.'

You'd get back the following tokens:

{
   "tokens" : [ {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 1
   }, {
      "token" : "evol",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
   }, {
      "token" : "sraeb",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
   }, {
      "token" : "dna",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 4
   }, {
      "token" : ".hsif",
      "start_offset" : 17,
      "end_offset" : 22,
      "type" : "word",
      "position" : 5
   } ]
}

This tokenizer first tokenized the sentence "I love Bears and Fish." into the tokens I, love, Bears, and, Fish.; next, it lowercased the tokens into i, love, bears, and, fish.; and finally, it reversed each token to get i, evol, sraeb, dna, .hsif.

Analyzing based on a field's mapping

One more helpful thing about the analysis API once you start creating mappings for an index is that Elasticsearch allows you to analyze based on a field where the mapping has already been created. If you create a mapping with a field description that looks like this snippet

... other mappings ...
"description ": {
   "type": "string",
   "analyzer": "italian"
}

you can then use the analyzer associated with the field by specifying the field parameter with the request:

% curl -XPOST 'localhost :9200/veggies/
   _analyze?field=description'
   -d'Era deliziosa'

The Italian analyzer will automatically be used, because it's the analyzer associated with the description field. Keep in mind that in order to use this, you'll need to specify an index, because Elasticsearch needs to be able to get the mappings for a particular field from an index.

Cover

Analyzing text with the analyze API

By Matthew Lee Hinman

This article is an excerpt from Elasticsearch in Action by Radu Gheorghe, Matthew Lee Hinman, and Roy Russo. Save 39% on Elasticsearch in Action with code 15dzamia at manning.com.

 


Tags: API, analyzer, text, Elasticsearch, Analyze API, tokenizer, token filter




Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel