Book Image

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau

Book Image

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Overview of this book

<p>Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.</p> <p>Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.</p>

Fast Data Processing with Spark Second Edition

Fast Data Processing with Spark Second Edition

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Installing Spark and Setting up your Cluster

Installing Spark and Setting up your Cluster

Directory organization and convention

Installing prebuilt distribution

Building Spark from source

A single machine

Running Spark on EC2

Deploying Spark with Chef (Opscode)

Deploying Spark on Mesos

Spark Standalone mode

Using the Spark Shell

Using the Spark Shell

Loading a simple text file

Using the Spark shell to run logistic regression

Interactively loading data from S3

Building and Running a Spark Application

Building and Running a Spark Application

Building your Spark project with sbt

Building your Spark job with Maven

Building your Spark job with something else

Creating a SparkContext

Creating a SparkContext

SparkContext – metadata

Shared Java and Scala APIs

Loading and Saving Data in Spark

Loading and Saving Data in Spark

Loading data into an RDD

Saving your data

Manipulating your RDD

Manipulating your RDD

Manipulating your RDD in Scala and Java

Manipulating your RDD in Python

Spark SQL

The Spark SQL architecture

Spark with Big Data

Spark with Big Data

Parquet – an efficient and interoperable big data format

Querying Parquet files with Impala

Machine Learning Using Spark MLlib

Machine Learning Using Spark MLlib

The Spark machine learning algorithm table

Spark MLlib examples

Testing

Testing in Java and Scala

Testing in Python

Tips and Tricks

Tips and Tricks

Where to find logs

Concurrency limitations

Using Spark with other languages

A quick note on security

Community developed packages

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

accumulate
- about / Manipulating your RDD in Scala and Java
Alternating Least Square (ALS) algorithm
- about / Recommendation
- reference link / Recommendation
Amazon Machine Images (AMI) / Running Spark on EC2 with the scripts
architecture, Spark SQL
- about / The Spark SQL architecture

B

basic statistics, Spark MLlib examples
- about / Basic statistics
broadcast
- about / Manipulating your RDD in Scala and Java

C

Chef
- about / Deploying Spark with Chef (Opscode)
- URL / Deploying Spark with Chef (Opscode)
- Spark, deploying with / Deploying Spark with Chef (Opscode)
- URL, for cookbook / Deploying Spark with Chef (Opscode)
classification, Spark MLlib examples
- about / Classification
clustering, Spark MLlib examples
- about / Clustering
code testable
- making / Making your code testable
commands, quick start
- URL / Running Spark shell in Python
community developed packages
- about / Community developed packages
concurrency, limitations
- about / Concurrency limitations
- memory usage, and garbage collection / Memory usage and garbage collection
- serialization / Serialization
- IDE integration / IDE integration
custom serializers
- references / Serialization

D

data
- loading, from S3 / Interactively loading data from S3
- loading, into RDD / Loading data into an RDD
- saving / Saving your data
datafiles, GitHub
- reference link / SQL access to a simple data table
directory
- organization / Directory organization and convention
- convention / Directory organization and convention
- references / Directory organization and convention
doctest / Testing in Python
double RDD functions
- about / Double RDD functions
- sampleStdev / Double RDD functions
- Stats / Double RDD functions
- Stdev / Double RDD functions
- Sum / Double RDD functions
- variance / Double RDD functions

E

EC2
- Spark, running on / Running Spark on EC2, Running Spark on EC2 with the scripts
EC2 command line tools
- references / Running Spark on EC2 with the scripts
EC2 scripts, Amazon
- URL / Running Spark on EC2 with the scripts
Elastic MapReduce (EMR)
- Spark, deploying on / Deploying Spark on Elastic MapReduce
ENhanced Scala Interaction Mode for Emacs
- ENSIME / IDE integration
ENSIME
- URL / IDE integration

F

files
- saving, to Parquet / Saving files to the Parquet format
- loading, to Parquet / Loading Parquet files
flatMap function
- about / Manipulating your RDD in Scala and Java
functions, for joining Pair RDDs
- about / Functions for joining PairRDDs
- coGroup / Functions for joining PairRDDs
- join / Functions for joining PairRDDs
- subtractKey / Functions for joining PairRDDs
functions, on JavaPairRDDs
- about / Functions on JavaPairRDDs
- cogroup / Functions on JavaPairRDDs
- combineByKey / Functions on JavaPairRDDs
- collectAsMap / Functions on JavaPairRDDs
- countByKey / Functions on JavaPairRDDs
- flatMapValues / Functions on JavaPairRDDs
- join / Functions on JavaPairRDDs
- keys / Functions on JavaPairRDDs
- lookup / Functions on JavaPairRDDs
- reduceByKey / Functions on JavaPairRDDs
- sortByKey / Functions on JavaPairRDDs
- values / Functions on JavaPairRDDs

G

general RDD functions
- about / General RDD functions
- aggregate / General RDD functions
- cache / General RDD functions
- collect / General RDD functions
- count / General RDD functions
- countByValue / General RDD functions
- distinct / General RDD functions
- filter / General RDD functions
- filterWith / General RDD functions
- first / General RDD functions
- flatMap / General RDD functions
- fold / General RDD functions
- foreach / General RDD functions
- groupBy / General RDD functions
- keyBy / General RDD functions
- map / General RDD functions
- mapPartitions / General RDD functions
- mapPartitionsWithIndex / General RDD functions
- mapWith / General RDD functions
- persist / General RDD functions
- pipe / General RDD functions
- sample / General RDD functions
- takeSample / General RDD functions
- toDebugString / General RDD functions
- union / General RDD functions
- unpersist / General RDD functions
- zip / General RDD functions
GitHub repository
- reference link, for data files / Spark MLlib examples

H

HBase
- about / HBase
- data, loading / Loading from HBase
- data, saving / Saving to HBase
- metadata, obtaining / Other HBase operations

I

Impala
- Parquet files, querying / Querying Parquet files with Impala
interactions
- testing, with SparkContext / Testing interactions with SparkContext

J

Java
- SparkContext object, creating in / Java
- RDD, manipulating in / Manipulating your RDD in Scala and Java
- using, as testing library / Testing in Java and Scala
Java RDD functions
- about / Java RDD functions, Common Java RDD functions
- Spark Java function classes / Spark Java function classes
- common Java RDD functions / Common Java RDD functions
- cache / Common Java RDD functions
- coalesce / Common Java RDD functions
- collect / Common Java RDD functions
- count / Common Java RDD functions
- countByValue / Common Java RDD functions
- distinct / Common Java RDD functions
- filter / Common Java RDD functions
- first / Common Java RDD functions
- flatMap / Common Java RDD functions
- fold / Common Java RDD functions
- foreach / Common Java RDD functions
- groupBy / Common Java RDD functions
- map / Common Java RDD functions
- mapPartitions / Common Java RDD functions
- reduce / Common Java RDD functions
- sample / Common Java RDD functions

L

lambda
- about / Manipulating your RDD in Scala and Java
latest development source, Spark
- references / Downloading the source
linear regression, Spark MLlib examples
- about / Linear regression
Logistic regression
- running, Spark shell used / Using the Spark shell to run logistic regression
logs
- finding / Where to find logs

M

mailing lists
- about / Mailing lists
- references / Mailing lists
map
- about / Manipulating your RDD in Scala and Java
map function
- about / Manipulating your RDD in Scala and Java
massively parallel processing (MPP)
- about / Querying Parquet files with Impala
Maven
- Spark job, building with / Building your Spark job with Maven
maven installation instructions
- references / Compiling the source with Maven
Mesos
- about / Deploying Spark on Mesos
- Spark, deploying on / Deploying Spark on Mesos
- URL / Deploying Spark on Mesos
metadata, SparkContext object / SparkContext – metadata
- appName / SparkContext – metadata
- getConf / SparkContext – metadata
- getExecutorMemoryStatus / SparkContext – metadata
- Master / SparkContext – metadata
- Version / SparkContext – metadata
- about / SparkContext – metadata
methods, for combining JavaRDDs
- about / Methods for combining JavaRDDs
- subtract / Methods for combining JavaRDDs
- union / Methods for combining JavaRDDs
- zip / Methods for combining JavaRDDs
multiple tables
- handling, with Spark SQL / Handling multiple tables with Spark SQL

N

nondata driven methods, SparkContext object
- addJar(path) / Shared Java and Scala APIs
- addFile(path) / Shared Java and Scala APIs
- stop() / Shared Java and Scala APIs
- clearFiles() / Shared Java and Scala APIs
- clearJars() / Shared Java and Scala APIs

P

package index site
- reference link / Community developed packages
pair RDD functions
- about / PairRDD functions
- collectAsMap / PairRDD functions
- reduceByKey / PairRDD functions
- countByKey / PairRDD functions
- join / PairRDD functions
- rightOuterJoin / PairRDD functions
- leftOuterJoin / PairRDD functions
- combineByKey / PairRDD functions
- zip / PairRDD functions
- groupByKey / PairRDD functions
- cogroup / PairRDD functions
PairRDD functions
- about / Other PairRDD functions
- lookup / Other PairRDD functions
- mapValues / Other PairRDD functions
- collectAsMap / Other PairRDD functions
- countByKey / Other PairRDD functions
- partitionBy / Other PairRDD functions
- flatMapValues / Other PairRDD functions
Parquet
- about / Parquet – an efficient and interoperable big data format
- files, saving / Saving files to the Parquet format
- files, loading / Loading Parquet files
- processed RDD, saving / Saving processed RDD in the Parquet format
Parquet files
- querying, with Impala / Querying Parquet files with Impala
Personal Package Archive (PPA) / Building your Spark project with sbt
prebuilt distribution
- installing / Installing prebuilt distribution
processed RDD
- saving, in Parquet / Saving processed RDD in the Parquet format
PySpark / Testing in Python
Python
- Spark shell, running in / Running Spark shell in Python
- SparkContext object, creating in / Python
- RDD, manipulating in / Manipulating your RDD in Python
Python testing, of Spark / Testing in Python

Q

QuickStart VM
- URL / Querying Parquet files with Impala

R

RDD
- about / RDDs
- data, loading into / Loading data into an RDD
- manipulating, in Scala / Manipulating your RDD in Scala and Java
- manipulating, in Java / Manipulating your RDD in Scala and Java
- manipulating, in Python / Manipulating your RDD in Python
- references / PairRDD functions
recommendation, Spark MLlib examples
- about / Recommendation
- reference link / Recommendation
reduce
- about / Manipulating your RDD in Scala and Java
Resilient Distributed Dataset (RDD) / Spark topology
Resilient Distributed Datasets (RDD) / Loading a simple text file
Run Length Encoding (RLE)
- about / Saving files to the Parquet format

S

S3
- data, loading from / Interactively loading data from S3
sbt
- Spark project, building with / Building your Spark project with sbt
Scala
- SparkContext object, creating in / Scala
- RDD, manipulating in / Manipulating your RDD in Scala and Java
Scala APIs / Shared Java and Scala APIs
Scala RDD functions
- about / Scala RDD functions
- foldByKey / Scala RDD functions
- reduceByKey / Scala RDD functions
- groupByKey / Scala RDD functions
ScalaTest
- using, as testing library / Testing in Java and Scala
security
- about / A quick note on security
shared Java APIs / Shared Java and Scala APIs
simple text file
- loading / Loading a simple text file
single machine
- about / A single machine
source
- Spark, building from / Building Spark from source
spam dataset, GitHub link
- URL / Loading a simple text file
Spark
- URL, for downloading / Installing prebuilt distribution
- building, from source / Building Spark from source
- URL, for building from source / Building Spark from source
- URL, for downloading latest source / Downloading the source
- installation, testing / Testing the installation
- running, on EC2 / Running Spark on EC2
- reference link, for latest onrunning spark on EC2 / Running Spark on EC2
- running on EC2, with scripts / Running Spark on EC2 with the scripts
- deploying, on Elastic MapReduce (EMR) / Deploying Spark on Elastic MapReduce
- deploying, with Chef / Deploying Spark with Chef (Opscode)
- deploying, on Mesos / Deploying Spark on Mesos
- URL, for configuration details on YARN / Spark on YARN
- standalone mode / Spark Standalone mode
- references / Building your Spark job with something else
- uisng, with other languages / Using Spark with other languages
Spark, building from source
- about / Building Spark from source
- download source / Downloading the source
- source, compiling with Maven / Compiling the source with Maven
- compilation switches / Compilation switches
Spark, on YARN
- about / Spark on YARN
SparkContext
- references / Python
- interactions, testing with / Testing interactions with SparkContext
SparkContext object
- creating, in Scala / Scala
- creating, in Java / Java
- metadata / SparkContext – metadata
- creating, in Python / Python
Spark documentation
- URL, for configuration / Memory usage and garbage collection
- URL, for RDDs / Memory usage and garbage collection
Spark Java function classes
- about / Spark Java function classes
- Function<T,R> / Spark Java function classes
- DoubleFunction<T> / Spark Java function classes
- PairFunction<T, K, V> / Spark Java function classes
- FlatMapFunction<T, R> / Spark Java function classes
- PairFlatMapFunction<T, K, V> / Spark Java function classes
- DoubleFlatMapFunction<T> / Spark Java function classes
- Function2<T1, T2, R> / Spark Java function classes
Spark job
- building, with Maven / Building your Spark job with Maven
- building / Building your Spark job with something else
Spark machine learning algorithm table
- about / The Spark machine learning algorithm table
Spark MLlib examples
- about / Spark MLlib examples
- basic statistics / Basic statistics
- linear regression / Linear regression
- classification / Classification
- clustering / Clustering
- recommendation / Recommendation
Spark project
- building, with sbt / Building your Spark project with sbt
Spark shell
- used, for running Logistic regression / Using the Spark shell to run logistic regression
- running, in Python / Running Spark shell in Python
Spark SQL
- architecture / The Spark SQL architecture
- overview / Spark SQL how-to in a nutshell
- multiple tables, handling with / Handling multiple tables with Spark SQL
- references / Aftermath
Spark SQL programming
- about / Spark SQL programming
- SQL access, to simple data table / SQL access to a simple data table
Spark SQL programming guide
- reference link / Spark SQL programming
Spark topology
- about / Spark topology
SQL scripts, Northwind database
- reference link / Spark SQL programming
standalone mode, Spark
- reference link / Spark Standalone mode
standard RDD functions
- about / Standard RDD functions
- flatMap / Standard RDD functions
- mapParitions / Standard RDD functions
- filter / Standard RDD functions
- distinct / Standard RDD functions
- union / Standard RDD functions
- cartesian / Standard RDD functions
- groupBy / Standard RDD functions
- pipe / Standard RDD functions
- foreach / Standard RDD functions
- reduce / Standard RDD functions
- fold / Standard RDD functions
- countByValue / Standard RDD functions
- take / Standard RDD functions
- partitionBy / Standard RDD functions

T

testing
- references / Testing in Python
type inference / Manipulating your RDD in Scala and Java

Y

YARN
- about / Spark on YARN