-
Book Overview & Buying
-
Table Of Contents
Apache Spark for Data Science Cookbook
By :
Spark SQL is a Spark module for structured data processing. It provides the programming abstraction called DataFrame (in earlier versions of Spark, it is called SchemaRDD) and also acts as distributed SQL query engine. The capabilities it provides are as follows:
A DataFrame is an RDD of row objects, each representing a record. It is also known as a schema of records. These can be created from external data sources, from results of queries, or from regular RDDs. The created DataFrame can be registered as a temporary table and apply SQLContext.sql or HiveContext.sql to query the table. This recipe shows how to work with DataFrames.
To step through this recipe, you will need a running Spark cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos.
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object JSONDataFrame {
def main(args:Array[String])
{
val conf=new SparkConf
conf.setMaster("spark://master:7077")
conf.setAppName("sql_Sample")
val sc=new SparkContext(conf)
val sqlcontxt=new SQLContext(sc)
val df = sqlContext.read.json("/home/padma/Sparkdev/spark-
1.6.0/examples/src/main/resources/people.json")
df.show
df.printSchema
df.select("name").show
df.select("name","age").show
df.select(df("name"),df("age")+4).show
df.groupBy("age").count.show
df.describe("name,age") } }
object DataFrames {
case class Person(name:String, age:Int)
def main(args:Array[String])
{
val conf = new SparkConf
conf.setMaster("spark://master:7077")
conf.setAppName("DataFramesApp")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val peopleDf = sc.textFile("/home/padma/Sparkdev/spark-
1.6.0/examples/src/main/resources/people.txt").
map(line => line.split(",")).map(p =>
Person(p(0),p(1).trim.toInt)).toDF
peopleDf.registerTempTable("people")
val teenagers = sqlContext.sql("select name, age from people
where age >=13 AND name in(select name from people where age=
30)")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
}
}
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df1 = sc.makeRDD(1 to 5).map(i =>
(i,i*2)).toDF("single","double")
df1.write.parquet("/home/padma/Sparkdev/SparkApp/
test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map(i =>
(i,i*4)).toDF("single","triple")
df2.write.parquet("/home/padma/Sparkdev/SparkApp/
test_table/key=2")
val df3 = sqlContext.read.parquet("/home/padma/Sparkdev/
SparkApp/test_table")
df3.show
Initially, the JSON file is read, which is the DataFrame, and the API such as show(), printSchema(), select(), or groupBy() can be invoked on the data frame. In the second code snippet, an RDD is created from the text file and the fields are mapped to the case class structure Person and the RDD is converted to a data frame using toDF. This data frame peopleDF is converted to a table using registerTempTable() whose table name is people. Now this table people can be queried using SQLContext.sql.
The final code snippet shows how to write a data frame as a parquet file using df1.write.parquet() and the parquet file is read using sqlContext.read.parquet().
Spark SQL in addition provides HiveContext, using which we can access Hive tables, UDFS, SerDes, and also HiveQL. There are ways to create DataFrames by converting an RDD to a DataFrame or creating them programmatically. The different data sources, such as JSON, Parquet, and Avro, can be handled and there is provision to directly run sql queries on the files. Also, data from other databases can be read using JDBC. In Spark 1.6.0, a new feature known as Dataset is introduced, which provides the benefits of Spark SQL's optimized execution engine over RDDs.
For more information on Spark SQL, please visit: http://spark.apache.org/docs/latest/sql-programming-guide.html. The earlier Working with the Spark programming model, Working with Spark's Python and Scala shells, and Working with pair RDDs recipes covered the initial steps in Spark and the basics of RDDs.
Change the font size
Change margin width
Change background colour