Book Image

Elasticsearch for Hadoop

By : Vishal Shukla
Book Image

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Table of Contents (15 chapters)
Elasticsearch for Hadoop
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Giving Spark to Elasticsearch


Spark is a distributed computing system that provides a huge performance boost as compared to Hadoop's MapReduce. It works on an abstraction of RDD (resilient-distributed datasets). This can be created for any data residing in Hadoop. Without any surprises, ES-Hadoop provides easy integration using Spark by enabling the creation of RDD from data in Elasticsearch.

Spark's increasing support for integrating various data sources, such as HDFS, Parquet, Avro, S3, Cassandra, relational databases, and streaming data make it special when it comes to data integration. This means that using ES-Hadoop (along with Spark), you can make all these sources integrate into Elasticsearch easily.

Setting up Spark

In order to set up Apache Spark to execute a job, you can perform the following steps:

  1. Download the Apache Spark distribution with the following command:

    $ sudo wget –O /usr/local/spark.tgz http://www.apache.org/dyn/closer.cgi/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz...