In Spark you can read data from a lot of sources, but in general NoSQL datastores such as HBase, Accumulo, and Cassandra you have a limited query subset and you often need to scan all the data to read only the required data. Using Elasticsearch you can retrieve a subset of documents that match your Elasticsearch query.
To read an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
You also need a working installation of Apache Spark and the data indexed in the previous example.
For reading data in Elasticsearch via Apache Spark, we will perform the steps given as follows:
We need to start the Spark Shell:
./bin/spark-shell
We import the required classes:
import org.elasticsearch.spark._
Now we can create a RDD by reading data from Elasticsearch:
val rdd=sc.esRDD("spark/persons")
We can watch...