Book Image

Elasticsearch Essentials

Book Image

Elasticsearch Essentials

Overview of this book

With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. ElasticSearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazing fast search solutions over a massive amount of data, but can also serve as a NoSQL data store. This guide will take you on a tour to become a competent developer quickly with a solid knowledge level and understanding of the ElasticSearch core concepts. Starting from the beginning, this book will cover these core concepts, setting up ElasticSearch and various plugins, working with analyzers, and creating mappings. This book provides complete coverage of working with ElasticSearch using Python and performing CRUD operations and aggregation-based analytics, handling document relationships in the NoSQL world, working with geospatial data, and taking data backups. Finally, we’ll show you how to set up and scale ElasticSearch clusters in production environments as well as providing some best practices.
Table of Contents (18 chapters)
Elasticsearch Essentials
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Introducing Elasticsearch


Elasticsearch is a distributed, full text search and analytic engine that is build on top of Lucene, a search engine library written in Java, and is also a base for Solr. After its first release in 2010, Elasticsearch has been widely adopted by large as well as small organizations, including NASA, Wikipedia, and GitHub, for different use cases. The latest releases of Elasticsearch are focusing more on resiliency, which builds confidence in users being able to use Elasticsearch as a data storeage tool, apart from using it as a full text search engine. Elasticsearch ships with sensible default configurations and settings, and also hides all the complexities from beginners, which lets everyone become productive very quickly by just learning the basics.

The primary features of Elasticsearch

Lucene is a blazing fast search library but it is tough to use directly and has very limited features to scale beyond a single machine. Elasticsearch comes to the rescue to overcome all the limitations of Lucene. Apart from providing a simple HTTP/JSON API, which enables language interoperability in comparison to Lucene's bare Java API, it has the following main features:

  • Distributed: Elasticsearch is distributed in nature from day one, and has been designed for scaling horizontally and not vertically. You can start with a single-node Elasticsearch cluster on your laptop and can scale that cluster to hundreds or thousands of nodes without worrying about the internal complexities that come with distributed computing, distributed document storage, and searches.

  • High Availability: Data replication means having multiple copies of data in your cluster. This feature enables users to create highly available clusters by keeping more than one copy of data. You just need to issue a simple command, and it automatically creates redundant copies of the data to provide higher availabilities and avoid data loss in the case of machine failure.

  • REST-based: Elasticsearch is based on REST architecture and provides API endpoints to not only perform CRUD operations over HTTP API calls, but also to enable users to perform cluster monitoring tasks using REST APIs. REST endpoints also enable users to make changes to clusters and indices settings dynamically, rather than manually pushing configuration updates to all the nodes in a cluster by editing the elasticsearch.yml file and restarting the node. This is possible because each resource (index, document, node, and so on) in Elasticsearch is accessible via a simple URI.

  • Powerful Query DSL: Query DSL (domain-specific language) is a JSON interface provided by Elasticsearch to expose the power of Lucene to write and read queries in a very easy way. Thanks to the Query DSL, developers who are not aware of Lucene query syntaxes can also start writing complex queries in Elasticsearch.

  • Schemaless: Being schemaless means that you do not have to create a schema with field names and data types before indexing the data in Elasticsearch. Though it is one of the most misunderstood concepts, this is one of the biggest advantages we have seen in many organizations, especially in e-commerce sectors where it's difficult to define the schema in advance in some cases. When you send your first document to Elasticsearch, it tries its best to parse every field in the document and creates a schema itself. Next time, if you send another document with a different data type for the same field, it will discard the document. So, Elasticsearch is not completely schemaless but its dynamic behavior of creating a schema is very useful.

    Note

    There are many more features available in Elasticsearch, such as multitenancy and percolation, which will be discussed in detail in the next chapters.

Understanding REST and JSON

Elasticsearch is based on a REST design pattern and all the operations, for example, document insertion, deletion, updating, searching, and various monitoring and management tasks, can be performed using the REST endpoints provided by Elasticsearch.

What is REST?

In a REST-based web API, data and services are exposed as resources with URLs. All the requests are routed to a resource that is represented by a path. Each resource has a resource identifier, which is called as URI. All the potential actions on this resource can be done using simple request types provided by the HTTP protocol. The following are examples that describe how CRUD operations are done with REST API:

  • To create the user, use the following:

    POST /user
    fname=Bharvi&lname=Dixit&age=28&id=123
    
  • The following command is used for retrieval:

    GET /user/123
    
  • Use the following to update the user information:

    PUT /user/123
    fname=Lavleen
    
  • To delete the user, use this:

    DELETE /user/123
    

    Note

    Many Elasticsearch users get confused between the POST and PUT request types. The difference is simple. POST is used to create a new resource, while PUT is used to update an existing resource. The PUT request is used during resource creation in some cases but it must have the complete URI available for this.

What is JSON?

All the real-world data comes in object form. Every entity (object) has some properties. These properties can be in the form of simple key value pairs or they can be in the form of complex data structures. One property can have properties nested into it, and so on.

Elasticsearch is a document-oriented data store where objects, which are called as documents, are stored and retrieved in the form of JSON. These objects are not only stored, but also the content of these documents gets indexed to make them searchable.

JavaScript Object Notation (JSON) is a lightweight data interchange format and, in the NoSQL world, it has become a standard data serialization format. The primary reason behind using it as a standard format is the language independency and complex nested data structure that it supports. JSON has the following data type support:

Array, Boolean, Null, Number, Object, and String

The following is an example of a JSON object, which is self-explanatory about how these data types are stored in key value pairs:

{
  "int_array": [1, 2,3],
  "string_array": ["Lucene" ,"Elasticsearch","NoSQL"],
  "boolean": true,
  "null": null,
  "number": 123,
  "object": {
    "a": "b",
    "c": "d",
    "e": "f"
  },
  "string": "Learning Elasticsearch"
}

Elasticsearch common terms

The following are the most common terms that are very important to know when starting with Elasticsearch:

  • Node: A single instance of Elasticsearch running on a machine.

  • Cluster: A cluster is the single name under which one or more nodes/instances of Elasticsearch are connected to each other.

  • Document: A document is a JSON object that contains the actual data in key value pairs.

  • Index: A logical namespace under which Elasticsearch stores data, and may be built with more than one Lucene index using shards and replicas.

  • Doc types: A doc type in Elasticsearch represents a class of similar documents. A type consists of a name, such as a user or a blog post, and a mapping, including data types and the Lucene configurations for each field. (An index can contain more than one type.)

  • Shard: Shards are containers that can be stored on a single node or multiple nodes and are composed of Lucene segments. An index is divided into one or more shards to make the data distributable.

    Note

    A shard can be either primary or secondary. A primary shard is the one where all the operations that change the index are directed. A secondary shard is the one that contains duplicate data of the primary shard and helps in quickly searching the data as well as for high availability; in a case where the machine that holds the primary shard goes down, then the secondary shard becomes the primary automatically.

  • Replica: A duplicate copy of the data living in a shard for high availability.

Understanding Elasticsearch structure with respect to relational databases

Elasticsearch is a search engine in the first place but, because of its rich functionality offerings, organizations have started using it as a NoSQL data store as well. However, it has not been made for maintaining the complex relationships that are offered by traditional relational databases.

If you want to understand Elasticsearch in relational database terms then, as shown in the following image, an index in Elasticsearch is similar to a database that consists of multiple types. A single row is represented as a document, and columns are similar to fields.

Elasticsearch does not have the concept of referential integrity constraints such as foreign keys. But, despite being a search engine and NoSQL data store, it does allow us to maintain some relationships among different documents, which will be discussed in the upcoming chapters.

With these theoretical concepts, we are good to go with learning the practical steps with Elasticsearch.

First of all, you need to be aware of the basic requirements to install and run Elasticsearch, which are listed as follows:

  • Java (Oracle Java 1.7u55 and above)

  • RAM: Minimum 2 GB

  • Root permission to install and configure program libraries

    Note

    Please go through the following URL to check the JVM and OS dependencies of Elasticsearch: https://www.elastic.co/subscriptions/matrix.

The most common error that comes up if you are using an incompatible Java version with Elasticsearch, is the following:

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/elasticsearch/bootstrap/Elasticsearch : Unsupported major.minor version 51.0
  at java.lang.ClassLoader.defineClass1(Native Method)
  at java.lang.ClassLoader.defineClassCond(ClassLoader.java:637)
  at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
  at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
  at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
  at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

If you see the preceding error while installing/working with Elasticsearch, it is most probably because you have an incompatible version of JAVA set as the JAVA_HOME variable or not set at all. Many users install the latest version of JAVA but forget to set the JAVA_HOME variable to the latest installation. If this variable is not set, then Elasticsearch looks into the following listed directories to find the JAVA and the first existing directory is used:

/usr/lib/jvm/jdk-7-oracle-x64,  /usr/lib/jvm/java-7-oracle,  /usr/lib/jvm/java-7-openjdk,  /usr/lib/jvm/java-7-openjdk-amd64/,  /usr/lib/jvm/java-7-openjdk-armhf,  /usr/lib/jvm/java-7-openjdk-i386/, /usr/lib/jvm/default-java