Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction


Spark can process data from various data sources such as HDFS, Cassandra, HBase, and relational databases, including HDFS. Big data frameworks (unlike relational database systems) do not enforce schema while writing. HDFS is a perfect example where any arbitrary file is welcome during the write phase. Reading data is a different story, however. You need to give some structure to even completely unstructured data to make sense out of it. With this structured data, SQL comes very handy when it comes to analysis.

Spark SQL is a relatively new component in Spark ecosystem, introduced in Spark 1.0 for the first time. It incorporates a project named Shark, which was an attempt to make Hive run on Spark.

Hive is essentially a relational abstraction, which converts SQL queries to MapReduce jobs.

Shark replaced the MapReduce part with Spark while retaining most of the code base.

Initially, it worked fine, but very soon, Spark developers hit roadblocks and could not optimize it any further...