-
Book Overview & Buying
-
Table Of Contents
Apache Spark for Data Science Cookbook
By :
This recipe shows how to initialize the SparkContext object as a part of many Spark applications. SparkContext is an object which allows us to create the base RDDs. Every Spark application must contain this object to interact with Spark. It is also used to initialize StreamingContext, SQLContext and HiveContext.
To step through this recipe, you will need a running Spark Cluster in any one of the modes that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optional), Scala, and Java. Please download the data from the following location:
https://github.com/ChitturiPadma/datasets/blob/master/stocks.txt
Let's see how to initialize SparkContext:
$SPARK_HOME/bin/spark-shell --master <master type>
Spark context available as sc.
$SPARK_HOME/bin/pyspark --master <master type>
SparkContext available as sc
$SPARK_HOME/bin/sparkR --master <master type>
Spark context is available as sc
SparkContext in different standalone applications, such as Scala, Java, and Python:Scala:
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkContextExample {
def main(args: Array[String]) {
val stocksPath = "hdfs://namenode:9000/stocks.txt"
val conf = new SparkConf().setAppName("Counting
Lines").setMaster("spark://master:7077")
val sc = new SparkContext(conf)
val data = sc.textFile(stocksPath, 2)
val totalLines = data.count()
println("Total number of Lines: %s".format(totalLines))
}
}
Java:
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
public class SparkContextExample {
public static void main(String[] args) {
String stocks = "hdfs://namenode:9000/stocks.txt"
SparkConf conf = new SparkConf().setAppName("Counting
Lines").setMaster("spark://master:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(stocks);
long totalLines = stocks.count();
System.out.println("Total number of Lines " + totalLines);
}
}
Python:
from pyspark
import SparkContext
stocks = "hdfs://namenode:9000/stocks.txt"
sc = SparkContext("<master URI>", "ApplicationName")
data = sc.textFile(stocks)
totalLines = data.count()
print("Total Lines are: %i" % (totalLines))
In the preceding code snippets, new SparkContext(conf), new JavaSparkContext(conf), and SparkContext("<master URI>", "ApplicationName") initialize SparkContext in three different languages: Scala, Java, and Python. SparkContext is the starting point for Spark functionality. It represents the connection to a Spark Cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster.
SparkContext is created on the driver. It connects with the cluster. Initially, RDDs are created using SparkContext. It is not serialized. Hence it cannot be shipped to workers. Also, only one SparkContext is available per application. In the case of Streaming applications and Spark SQL modules, StreamingContext and SQLContext are created on top of SparkContext.
To understand more about the SparkContext object and its methods, please refer to this documentation page: https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.SparkContext.
Change the font size
Change margin width
Change background colour