Fast Data Processing with Spark

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau

Buy this Book

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Buy this Book

Overview of this book

<p>Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.</p> <p>Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.</p>

Fast Data Processing with Spark Second Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Installing Spark and Setting up your Cluster

Directory organization and convention

Installing prebuilt distribution

Building Spark from source

Spark topology

A single machine

Running Spark on EC2

Deploying Spark with Chef (Opscode)

Deploying Spark on Mesos

Spark on YARN

Spark Standalone mode

Summary

Using the Spark Shell

Loading a simple text file

Using the Spark shell to run logistic regression

Interactively loading data from S3

Summary

Building and Running a Spark Application

Building your Spark project with sbt

Building your Spark job with Maven

Building your Spark job with something else

Summary

Creating a SparkContext

Scala

Java

SparkContext – metadata

Shared Java and Scala APIs

Python

Summary

Loading and Saving Data in Spark

RDDs

Loading data into an RDD

Saving your data

Summary

Manipulating your RDD

Manipulating your RDD in Scala and Java

Manipulating your RDD in Python

Summary

Spark SQL

The Spark SQL architecture

Summary

Spark with Big Data

Parquet – an efficient and interoperable big data format

Querying Parquet files with Impala

HBase

Summary

Machine Learning Using Spark MLlib

The Spark machine learning algorithm table

Spark MLlib examples

Summary

Testing

Testing in Java and Scala

Testing in Python

Summary

Tips and Tricks

Where to find logs

Concurrency limitations

Using Spark with other languages

A quick note on security

Community developed packages

Mailing lists

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Loading a simple text file

While running a Spark shell and connecting to an existing cluster, you should see something specifying the app ID such as "Connected to Spark cluster with app ID app-20130330015119-0001." The app ID will match the application entry as shown in the Web UI under running applications (by default, it will be viewable on port 4040). Start by downloading a dataset to use for some experimentation. There are a number of datasets put together for The Elements of Statistical Learning, which are in a very convenient form to use. Grab the spam dataset using the following command:

wget http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data

Alternatively, you can find the spam dataset from the GitHub link at https://github.com/xsankar/fdps-vii.

Now, load it as a text file into Spark with the following command inside your Spark shell:

scala> val inFile = sc.textFile("./spam.data")

This loads the spam.data file into Spark with each line being a separate entry in...

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Overview of this book

Related Content you might be interested in

Current Title:

Fast Data Processing with Spark - Second Edition

Loading a simple text file