Book Image

Elasticsearch 7.0 Cookbook - Fourth Edition

By : Alberto Paro
Book Image

Elasticsearch 7.0 Cookbook - Fourth Edition

By: Alberto Paro

Overview of this book

Elasticsearch is a Lucene-based distributed search server that allows users to index and search unstructured content with petabytes of data. With this book, you'll be guided through comprehensive recipes on what's new in Elasticsearch 7, and see how to create and run complex queries and analytics. Packed with recipes on performing index mapping, aggregation, and scripting using Elasticsearch, this fourth edition of Elasticsearch Cookbook will get you acquainted with numerous solutions and quick techniques for performing both every day and uncommon tasks such as deploying Elasticsearch nodes, integrating other tools to Elasticsearch, and creating different visualizations. You will install Kibana to monitor a cluster and also extend it using a variety of plugins. Finally, you will integrate your Java, Scala, Python, and big data applications such as Apache Spark and Pig with Elasticsearch, and create efficient data applications powered by enhanced functionalities and custom plugins. By the end of this book, you will have gained in-depth knowledge of implementing Elasticsearch architecture, and you'll be able to manage, search, and store data efficiently and effectively using Elasticsearch.
Table of Contents (23 chapters)
Title Page

Setting up an ingestion node

The main goals of Elasticsearch are indexing, searching, and analytics, but it's often required to modify or enhance the documents before storing them in Elasticsearch.

The following are the most common scenarios in this case: 

  • Preprocessing the log string to extract meaningful data
  • Enriching the content of textual fields with Natural Language Processing (NLP) tools
  • Enriching the content using machine learning (ML) computed fields
  • Adding data modification or transformation during ingestion, such as the following:
    • Converting IP in geolocalization
    • Adding datetime fields at ingestion time
    • Building custom fields (via scripting) at ingestion time

Getting ready

You need a working Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe, as well as a simple text editor to change configuration files.

How to do it…

To set up an ingest node, you need to edit the config/elasticsearch.yml file and set up the ingest property to trueas follows:

node.ingest: true
Every time you change your elasticsearch.yml file, a node restart is required.

How it works…

The default configuration for Elasticsearch is to set the node as an ingest node (refer to Chapter 12, Using the Ingest module, for more information on the ingestion pipeline).

As the coordinator node, using the ingest node is a way to provide functionalities to Elasticsearch without suffering cluster safety.

If you want to prevent a node from being used for ingestion, you need to disable it with node.ingest: false. It's a best practice to disable this in the master and data nodes to prevent ingestion error issues and to protect the cluster. The coordinator node is the best candidate to be an ingest one.

If you are using NLP, attachment extraction (via, attachment ingest plugin), or logs ingestion, the best practice is to have a pool of coordinator nodes (no master, no data) with ingestion active.

The attachment and NLP plugins in the previous version of Elasticsearch were available in the standard data node or master node. These give a lot of problems to Elasticsearch due to the following reasons:

  • High CPU usage for NLP algorithms that saturates all CPU on the data node, giving bad indexing and searching performances
  • Instability due to the bad format of attachment and/or Apache Tika bugs (the library used for managing document extraction)
  • NLP or ML algorithms require a lot of CPU or stress the Java garbage collector, decreasing the performance of the node

The best practice is to have a pool of coordinator nodes with ingestion enabled to provide the best safety for the cluster and ingestion pipeline.

There's more…

Having known about the four kinds of Elasticsearch nodes, you can easily understand that a waterproof architecture designed to work with Elasticsearch should be similar to this one: