-
Book Overview & Buying
-
Table Of Contents
ElasticSearch Blueprints
By :
Analyzers constitute an important part of indexing. To understand what analyzers do, let's consider three documents:
{ This , is , easy }{ This , is , fast }{ This , is , easy , and , fast }Here, terms such as This, is, as well as and are not relevant keywords. The chances of someone wanting to search for such words are very less, as these words don't contribute to the facts or context of the document. Hence, it's safe to avoid these words while indexing or rather you should avoid making these words searchable.
So, the tokenization would be as follows:
{ easy }{ fast }{ easy , fast }Words such as the, or, as well as and are referred to as stop words. In most cases, these are for grammatical support and the chances that someone will search based on these words are slim. Also, the analysis and removal of stop words is very much language dependent. The process of selecting/transforming the searchable tokens from a document while indexing is called analyzing. The module that facilitates this is called an analyzer. The analyzer we just discussed is a stop word analyzer. By applying the right analyzer, you can minimize the number of searchable tokens and hence get better performance results.
There are three stages through which you can perform an analysis:
CHAR filter to do the work.length removes words which are too long or too short for the stream.Here is a flowchart that depicts this process:

It should be noted that any number of such components can be incorporated in each stage. A combination of these components is called an analyzer. To create an analyzer out of the existing components, all we need to do is add the configuration to our Elasticsearch configuration file.
The following are the different types of character filters:
"mappings" : ["ph=>f", "qu=>q"]
The following are different types of tokenizers:
"Latin America is a great place to go in summer" => { "Latin America" ,"America is" , "is a" , "a great" , "great place" , "place to" , "to go" , "go in" ,
"in summer" }The following are the different types of token filters:
Now, let's create our own analyzer and apply it on an index. I want to make an analyzer that strips out HTML tags before indexing. Also, there should not be any differentiation between lowercase and uppercase while searching. In short, the search is case insensitive. We are not interested in searching words such as "is" and "the", which are stop words. Also, we are not interested in words that have more than 900 characters. The following are the settings that you need to paste in the config/Elasticsearch.yml file to create this analyzer:
index : analysis : analyzer : myCustomAnalyzer : tokenizer : smallLetter filter : [lowercase, stopWord] char_filter : [html_strip] tokenizer : smallLetter: type : standard max_token_length : 900 filter : stopWord: type : stop stopwords : ["are" , "the" , "is"]
Here, I named my analyzer myCustomAnalyzer. By adding the character filter html_strip, all HTML tags are removed out of the stream. A filter called stopWord is created, where we define the stop words. If we don't mention the stop words, those are taken from the default set. The smallLetter tokenizer removes all the words that have more than 900 characters.
A combination of character filters, token filters, and tokenizers is called an analyzer. You can make your own analyzer using these building blocks, but then, there are readymade analyzers that work well in most of the use cases. A Snowball Analyzer is an analyzer of the type snowball that uses the standard tokenizer with the standard filter, lowercase filter, stop filter, and snowball filter, which is a stemming filter.
Here is how you can pass the analyzer setting to Elasticsearch:
curl -X PUT "http://localhost:9200/wiki" -d '{
"index" : {
"number_of_shards" : 4,
"number_of_replicas" : 1 ,
"analysis":{
"analyzer":{
"content" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase" , "stop" , "kstem"],
"char_filter" : ["html_strip"]
}
}
}
}
}'Having understood how we can create an index and define field mapping with the analyzers, we shall go ahead and index some Wikipedia documents. For quick demonstration purpose, I have created a simple Python script to make some JSON documents. I am trying to create corresponding JSON files for the wiki pages for the following countries:
Here is the script written in Python if you want to use it. This takes as input two command-line arguments: the first one is the title of the page and the second is the link:
import urllib2
import json
import sys
link = sys.argv[2]
htmlObj = { "link" : link ,
"Author" : "anonymous" ,
"timestamp" : "09-02-2014 14:16:00",
"Title" : sys.argv[1]
}
response = urllib2.urlopen(link)
htmlObj['html'] = response.read()
print json.dumps(htmlObj , indent=4)Let's assume the name of the Python file is json_generator.py. The following is how we execute it:
Python json_generator.py https://en.wikipedia.org/wiki/France > France.json'.
Now, we have a JSON file called France.json that has a sample data we are looking for.
I assume that you generated JSON files for each country that we mentioned. As seen earlier, indexing a document once it is created is simple. Using the script shown next, I created the index and defined the mappings:
curl -X PUT "http://localhost:9200/wiki" -d '{
"index" : {
"number_of_shards" : 4,
"number_of_replicas" : 1 ,
"analysis":{
"analyzer":{
"content" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["lowercase" , "stop" , "kstem"],
"char_filter" : ["html_strip"]
}
}
}
}
}'
curl -X PUT "http://localhost:9200/wiki/articles/_mapping" -d '{
"articles" :{
"_all" : {"enabled" : true },
"properties" :{
"Title" : { "type" : "string" , "Analyzer":"content" , "include_in_all" : true},
"link" : { "type" : "string" , "include_in_all" : false , "index" : "no" },
"Author" : { "type" : "string" , "include_in_all" : false },
"timestamp" : { "type" : "date", "format" : "dd-MM-yyyy HH:mm:ss" , "include_in_all" : false },
"html" : { "type" : "string" ,"Analyzer":"content" , "include_in_all" : true }
}
}
}'Once this is done, documents can be indexed like this. I assume that you have the file India.json. You can index it as:
curl -XPOST 'http://localhost:9200/wiki/articles/' -d @India.json
Index all the documents likewise.
Change the font size
Change margin width
Change background colour