Spark Cookbook

Book Image

Spark Cookbook

By : Rishi Yadav

Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Spark Cookbook

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Apache Spark

Getting Started with Apache Spark

Installing Spark from binaries

Building the Spark source code with Maven

Launching Spark on Amazon EC2

Deploying on a cluster in standalone mode

Deploying on a cluster with Mesos

Deploying on a cluster with YARN

Using Tachyon as an off-heap storage layer

Developing Applications with Spark

Developing Applications with Spark

Exploring the Spark shell

Developing Spark applications in Eclipse with Maven

Developing Spark applications in Eclipse with SBT

Developing a Spark application in IntelliJ IDEA with Maven

Developing a Spark application in IntelliJ IDEA with SBT

External Data Sources

External Data Sources

Loading data from the local filesystem

Loading data from HDFS

Loading data from HDFS using a custom InputFormat

Loading data from Amazon S3

Loading data from Apache Cassandra

Loading data from relational databases

Spark SQL

Understanding the Catalyst optimizer

Creating HiveContext

Inferring schema using case classes

Programmatically specifying the schema

Loading and saving data using the Parquet format

Loading and saving data using the JSON format

Loading and saving data from relational databases

Loading and saving data from an arbitrary source

Spark Streaming

Spark Streaming

Word count using Streaming

Streaming Twitter data

Streaming using Kafka

Getting Started with Machine Learning Using MLlib

Getting Started with Machine Learning Using MLlib

Creating vectors

Creating a labeled point

Creating matrices

Calculating summary statistics

Calculating correlation

Doing hypothesis testing

Creating machine learning pipelines using ML

Supervised Learning with MLlib – Regression

Supervised Learning with MLlib – Regression

Using linear regression

Understanding cost function

Doing linear regression with lasso

Doing ridge regression

Supervised Learning with MLlib – Classification

Supervised Learning with MLlib – Classification

Doing classification using logistic regression

Doing binary classification using SVM

Doing classification using decision trees

Doing classification using Random Forests

Doing classification using Gradient Boosted Trees

Doing classification with Naïve Bayes

Unsupervised Learning with MLlib

Unsupervised Learning with MLlib

Clustering using k-means

Dimensionality reduction with principal component analysis

Dimensionality reduction with singular value decomposition

Recommender Systems

Recommender Systems

Collaborative filtering using explicit feedback

Collaborative filtering using implicit feedback

Graph Processing Using GraphX

Graph Processing Using GraphX

Fundamental operations on graphs

Finding connected components

Performing neighborhood aggregation

Optimizations and Performance Tuning

Optimizations and Performance Tuning

Optimizing memory

Using compression to improve performance

Using serialization to improve performance

Optimizing garbage collection

Optimizing the level of parallelism

Understanding the future of optimization – project Tungsten

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Loading data from relational databases

A lot of important data lies in relational databases that Spark needs to query. JdbcRDD is a Spark feature that allows relational tables to be loaded as RDDs. This recipe will explain how to use JdbcRDD.

Spark SQL to be introduced in the next chapter includes a data source for JDBC. This should be preferred over the current recipe as results are returned as DataFrames (to be introduced in the next chapter), which can be easily processed by Spark SQL and also joined with other data sources.

Getting ready

Please make sure that the JDBC driver JAR is visible on the client node and all slaves nodes on which executor will run.

How to do it...

Perform the following steps to load data from relational databases:

Create a table named person in MySQL using the following DDL:

CREATE TABLE 'person' (
  'person_id' int(11) NOT NULL AUTO_INCREMENT,
  'first_name' varchar(30) DEFAULT NULL,
  'last_name' varchar(30) DEFAULT NULL,
  'gender' char(1) DEFAULT NULL,
  PRIMARY...