Elasticsearch for Hadoop

We explored the Elasticsearch architecture and the way Elasticsearch achieves scalability in the distributed environment. Hadoop also works in a distributed environment. In this section, we will explore how ES-Hadoop leverages these two distributed systems to combine the capabilities of both systems.

Dynamic parallelism

We are already familiar with the unit of parallelism in Elasticsearch as a shard. The more shards we have, the more parallelism we get, provided that different shards don't compete against the same resources. Similarly, you may be already aware about the fact that a split represents the unit of parallelization in Hadoop. InputSplit represents the data input for one mapper. When we run a Hadoop job, InputFormat divides the input into several InputSplits. This is passed on to individual mapper classes for further processing.

The following image shows how ES-Hadoop makes the clusters of Hadoop and Elasticsearch talk to each other:

Here, we can see the...

Elasticsearch for Hadoop

By : Vishal Shukla

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Related Content you might be interested in

Current Title:

Elasticsearch for Hadoop

The ES-Hadoop architecture

Dynamic parallelism