Book Image

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau
Book Image

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Overview of this book

<p>Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.</p> <p>Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.</p>
Table of Contents (18 chapters)
Fast Data Processing with Spark Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Loading a simple text file


While running a Spark shell and connecting to an existing cluster, you should see something specifying the app ID such as "Connected to Spark cluster with app ID app-20130330015119-0001." The app ID will match the application entry as shown in the Web UI under running applications (by default, it will be viewable on port 4040). Start by downloading a dataset to use for some experimentation. There are a number of datasets put together for The Elements of Statistical Learning, which are in a very convenient form to use. Grab the spam dataset using the following command:

wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data

Alternatively, you can find the spam dataset from the GitHub link at https://github.com/xsankar/fdps-vii.

Now, load it as a text file into Spark with the following command inside your Spark shell:

scala> val inFile = sc.textFile("./spam.data")

This loads the spam.data file into Spark with each line being a separate entry in...