We explored the Elasticsearch architecture and the way Elasticsearch achieves scalability in the distributed environment. Hadoop also works in a distributed environment. In this section, we will explore how ES-Hadoop leverages these two distributed systems to combine the capabilities of both systems.
We are already familiar with the unit of parallelism in Elasticsearch as a shard. The more shards we have, the more parallelism we get, provided that different shards don't compete against the same resources. Similarly, you may be already aware about the fact that a split represents the unit of parallelization in Hadoop. InputSplit
represents the data input for one mapper. When we run a Hadoop job, InputFormat
divides the input into several InputSplits
. This is passed on to individual mapper classes for further processing.
The following image shows how ES-Hadoop makes the clusters of Hadoop and Elasticsearch talk to each other:
Here, we can see the...