ElasticSearch Cookbook

ElasticSearch Cookbook

By : Alberto Paro

Buy this Book

ElasticSearch Cookbook

By: Alberto Paro

Buy this Book

Overview of this book

ElasticSearch is one of the most promising NoSQL technologies available and is built to provide a scalable search solution with built-in support for near real-time search and multi-tenancy. This practical guide is a complete reference for using ElasticSearch and covers 360 degrees of the ElasticSearch ecosystem. We will get started by showing you how to choose the correct transport layer, communicate with the server, and create custom internal actions for boosting tailored needs. Starting with the basics of the ElasticSearch architecture and how to efficiently index, search, and execute analytics on it, you will learn how to extend ElasticSearch by scripting and monitoring its behaviour. Step-by-step, this book will help you to improve your ability to manage data in indexing with more tailored mappings, along with searching and executing analytics with facets. The topics explored in the book also cover how to integrate ElasticSearch with Python and Java applications. This comprehensive guide will allow you to master storing, searching, and analyzing data with ElasticSearch.

ElasticSearch Cookbook

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started

Introduction

Understanding node and cluster

Understanding node services

Managing your data

Understanding cluster, replication, and sharding

Communicating with ElasticSearch

Using the HTTP protocol

Using the Native protocol

Using the Thrift protocol

Downloading and Setting Up ElasticSearch

Introduction

Downloading and installing ElasticSearch

Networking setup

Setting up a node

Setting up ElasticSearch for Linux systems (advanced)

Setting up different node types (advanced)

Installing a plugin

Installing a plugin manually

Removing a plugin

Changing logging settings (advanced)

Managing Mapping

Introduction

Using explicit mapping creation

Using dynamic templates in document mapping

Managing nested objects

Managing a child document

Mapping a multifield

Mapping a GeoPoint field

Mapping a GeoShape field

Mapping an IP field

Mapping an attachment field

Adding generic data to mapping

Mapping different analyzers

Standard Operations

Introduction

Creating an index

Deleting an index

Opening/closing an index

Putting a mapping in an index

Checking if an index or type exists

Managing index settings

Speeding up atomic operations (bulk)

Speeding up GET

Search, Queries, and Filters

Executing a scan query

Suggesting a correct query

Counting

Deleting by query

Matching all the documents

Querying/filtering for term

Querying/filtering for terms

Using a prefix query/filter

Using a Boolean query/filter

Using a range query/filter

Using span queries

Using the match query

Using the IDS query/filter

Using the has_child query/filter

Using the top_children query

Using the has_parent query/filter

Using a regexp query/filter

Using exists and missing filters

Using and/or/not filters

Using the geo_bounding_box filter

Using the geo_polygon filter

Using the geo_distance filter

Facets

Introduction

Executing facets

Executing terms facets

Executing range facets

Executing histogram facets

Executing date histogram facets

Executing filter/query facets

Executing statistical facets

Executing term statistical facets

Executing geo distance facets

Scripting

Introduction

Installing additional script plugins

Sorting using script

Computing return fields with scripting

Filtering a search via scripting

Updating with scripting

Rivers

Introduction

Managing a river

Using the CouchDB river

Using the MongoDB river

Using the RabbitMQ river

Using the JDBC river

Using the Twitter river

Cluster and Nodes Monitoring

Introduction

Controlling cluster health via API

Controlling cluster state via API

Getting nodes information via API

Getting node statistic via API

Installing and using BigDesk

Installing and using ElasticSerach-head

Installing and using SemaText SPM

Java Integration

Introduction

Creating an HTTP client

Creating a native client

Managing indices with the native client

Executing a standard search

Executing a facet search

Executing a scroll/scan search

Python Integration

Executing a standard search

Executing a facet search

Plugin Development

Introduction

Creating a site plugin

Creating a simple plugin

Creating a REST plugin

Creating a cluster action

Creating an analyzer plugin

Creating a river plugin

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding cluster, replication, and sharding

Related to shard management, there is the key concept of replication and cluster status.

Getting ready

You need one or more nodes running to have a cluster. To test an effective cluster you need at least two nodes (they can be on the same machine).

How it works...

An index can have one or more replicas—the shards are called primary if they are part of the master index and secondary if they are part of replicas.

To maintain consistency in write operations the following workflow is executed:

The write is first executed in the primary shard.
If the primary write is successfully done, it is propagated simultaneously in all the secondary shards.
If a primary shard dies, a secondary one is elected as primary (if available) and the flow is re-executed.

During search operations, a valid set of shards is chosen randomly between primary and secondary to improve performances.

The following figure shows an example of possible shards configuration:

Best practice

In order to prevent data loss and to have High Availability, it's good to have at least one replica so that your system can survive a node failure without downtime and without loss of data.

There's more…

Related to the concept of replication there is the cluster indicator of the health of your cluster.

It can cover three different states:

Green: Everything is ok.
Yellow: Something is missing but you can work.
Red: "Houston, we have a problem". Some primary shards are missing.

How to solve the yellow status

Mainly yellow status is due to some shards that are not allocated. If your cluster is in recovery status, just wait if there is enough space in nodes for your shards.

If your cluster, even after recovery is still in yellow state, it means you don't have enough nodes to contain your replicas so you can either reduce the number of your replicas or add the required number of nodes.

Best practice

The total number of nodes must not be lower than the maximum number of replicas.

How to solve the red status

When you have lost data (that is, one or more shard is missing), you need to try restoring the node(s) that are missing. If your nodes restart and the system goes back to yellow or green status you are safe. Otherwise, you have lost data and your cluster is not usable. In this case, delete the index/indices and restore them from backup (if you have it) or from other sources.

Best practice

To prevent data loss, I suggest having always at least two nodes and the replica set to 1.

Tip

Having one or more replicas on different nodes on different machines allows you to have a live backup of your data, always updated.

ElasticSearch Cookbook

By : Alberto Paro

ElasticSearch Cookbook

By: Alberto Paro

Overview of this book

Related Content you might be interested in

Current Title:

ElasticSearch Cookbook

Understanding cluster, replication, and sharding

Getting ready

How it works...

Best practice

There's more…

How to solve the yellow status

Best practice

How to solve the red status

Best practice

Tip

See also