Book Image

ElasticSearch Blueprints

Book Image

ElasticSearch Blueprints

Overview of this book

Table of Contents (15 chapters)
Elasticsearch Blueprints
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Communicating with the Elasticsearch server


cURL will be our tool of choice that we will use to communicate with Elasticsearch. Elasticsearch follows a REST-like protocol for its exposed web API. Some of its features are as follows:

  • PUT: The HTTP method PUT is used to send configurations to Elasticsearch.

  • POST: The HTTP method POST is used to create new documents or to perform a search operation. While successful indexing of documents is done using POST, Elasticsearch provides you with a unique ID that points to the index file.

  • GET: The HTTP method GET is used to retrieve an already indexed document. Each document has a unique ID called a doc ID (short form for document's ID). When we index a document using POST, it provides a document ID, which can be used to retrieve the original document.

  • DELETE: The HTTP method DELETE is used to delete documents from the Elasticsearch index. Deletion can be performed based on a search query or directly using the document ID.

To specify the HTTP method in cURL, you can use the -X option, for example, CURL -X POST http://localhost/. JSON is the data format used to communicate with Elasticsearch. To specify the data in cURL, we can specify it in the following forms:

  • A command line: You can use the -d option to specify the JSON to be sent in the command line itself, for example:

    curl –X POST 'http://localhost:9200/news/public/' –d '{ "time" : "12-10-2010"}
    
  • A file: If the JSON is too long or inconvenient to be mentioned in a command line, you can specify it in a file or ask cURL to pick the JSON up from the file. You need to use the same -d option with a @ symbol just before the filename, for example:

    curl –X POST 'http://localhost:9200/news/public/' –d @file
    

Shards and replicas

The concept of sharding is introduced in Elasticsearch to provide horizontal scaling. Scaling, as you know, is to increase the capacity of the search engine, both the index size and the query rate (query per second) capacity. Let's say an application can store up to 1,000 feeds and gives reasonable performance. Now, we need to increase the performance of this application to 2,000 feeds. This is where we look for scaling solutions. There are two types of scaling solutions:

  • Vertical scaling: Here, we add hardware resources, such as more main memory, more CPU cores, or RAID disks to increase the capacity of the application.

  • Horizontal scaling: Here, we add more machines to the system. As in our example, we bring in one more machines and give both the machines 1,000 feeds each. The result is computed by merging the results from both the machines. As both the processes take place in parallel, they won't eat up more time or bandwidth.

Guess what! Elasticsearch can be scaled both horizontally and vertically. You can increase its main memory to increase its performance and you can simply add a new machine to increase its capacity. Horizontal scaling is implemented using the concept of sharding in Elasticsearch. Since Elasticsearch is a distributed system, we need to address our data safety/availability concerns. Using replicas we achieve this. When one replica (size 1) is defined for a cluster with more than one machine, two copies of the entire feed become available in the distributed system. This means that even if a single machine goes down, we won't lose data and at the same time. The load would be distributed somewhere else. One important point to mention here is that the default number of shards and replicas are generally sufficient and also, we have the provision to change the replica number later on.

This is how we create an index and pass the number of shards and replicas:

curl -X PUT "localhost:9200/news" -d '{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 1
}
}
}'

A few things to be noted here are:

  • Adding more primary shards will increase the write throughout the index

  • Adding more replicas will increase the durability of the index and the read throughout, at the cost of disk space

Index-type mapping

An index is a grouping logic where feeds of the same type are encapsulated together. A type is a sub grouping logic under index. To create a type under index, you need to decide on a type name. As in our case, we take the index name as news and the type name as public. We created the index in the previous step and now we need to define the data types of the fields that our data hold in the type mapping section.

Check out the sample given next. Here, the date data type takes the time format to be yyyy/MM/dd HH:mm:ss by default:

curl -X PUT "localhost:9200/news/public/_mapping" -d '{
"public" :{
"properties" :{
"Title" : {"type" : "string" },
"Content": {"type" : "string" },
"DOP": {"type" : "date" }
}
}
}'

Once we apply mapping, certain aspects of it such as new field definitions can be updated. However, we can't update certain other aspects such as changing the type of a field or changing the assigned analyzer. So, we now know how to create an index and add necessary mappings to the index we created. There is another important thing that you must take care of while indexing your data, that is, the analysis of our data. I guess you already know the importance of analysis. In simple terms, analysis is the breaking down of text into an elementary form called tokens. This tokenization is a must and has to be given serious consideration. Elasticsearch has many built-in analyzers that do this job for you. At the same time, you are free to deploy your own custom analyzers as well if the built-in analyzers do not serve your purpose. Let's see analysis in detail and how we can define analyzers for fields.