Book Image

Apache Spark Machine Learning Blueprints

By : Alex Liu

Book Image

Apache Spark Machine Learning Blueprints

By: Alex Liu

Overview of this book

There's a reason why Apache Spark has become one of the most popular tools in Machine Learning – its ability to handle huge datasets at an impressive speed means you can be much more responsive to the data at your disposal. This book shows you Spark at its very best, demonstrating how to connect it with R and unlock maximum value not only from the tool but also from your data. Packed with a range of project "blueprints" that demonstrate some of the most interesting challenges that Spark can help you tackle, you'll find out how to use Spark notebooks and access, clean, and join different datasets before putting your knowledge into practice with some real-world projects, in which you will see how Spark Machine Learning can help you with everything from fraud detection to analyzing customer attrition. You'll also find out how to build a recommendation engine using Spark's parallel computing powers.

Apache Spark Machine Learning Blueprints

Apache Spark Machine Learning Blueprints

Credits

About the Author

About the Author

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Spark for Machine Learning

Spark for Machine Learning

Spark overview and Spark advantages

Spark computing for machine learning

Machine learning algorithms

Spark RDD and dataframes

ML workflows and Spark pipelines

ML workflow examples

Spark notebooks

Data Preparation for Spark ML

Data Preparation for Spark ML

Accessing and loading datasets

Identity matching

Dataset reorganizing

Dataset joining

Feature extraction

Repeatability and automation

A Holistic View on Spark

A Holistic View on Spark

Spark for a holistic view

Methods for a holistic view

Feature preparation

Model estimation

Model evaluation

Results explanation

Fraud Detection on Spark

Fraud Detection on Spark

Spark for fraud detection

Methods for fraud detection

Feature preparation

Model estimation

Model evaluation

Results explanation

Deploying fraud detection

Risk Scoring on Spark

Risk Scoring on Spark

Spark for risk scoring

Methods of risk scoring

Data and feature preparation

Model estimation

Model evaluation

Results explanation

Churn Prediction on Spark

Churn Prediction on Spark

Spark for churn prediction

Methods for churn prediction

Feature preparation

Model estimation

Model evaluation

Results explanation

Recommendations on Spark

Recommendations on Spark

Apache Spark for a recommendation engine

Methods for recommendation

Data treatment with SPSS

Model estimation

Model evaluation

Recommendation deployment

Learning Analytics on Spark

Learning Analytics on Spark

Spark for attrition prediction

Methods of attrition prediction

Feature preparation

Model estimation

Model evaluation

Results explanation

City Analytics on Spark

City Analytics on Spark

Spark for service forecasting

Data and feature preparation

Model estimation

Model evaluation

Explanations of the results

Learning Telco Data on Spark

Learning Telco Data on Spark

Spark for using Telco Data

Methods for learning from Telco Data

Data and feature development

Model estimation

Model evaluation

Results explanation

Model deployment

Modeling Open Data on Spark

Modeling Open Data on Spark

Spark for learning from open data

Data and feature preparation

Model estimation

Results explanation

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

accumulators
- about / The Spark computing framework
ACT
- URL / The use case
Alternating Least Squares (ALS) algorithm / Collaborative filtering
Apache Spark
- URL / Spark computing
Apache Spark Notebooks
- about / Apache Spark notebooks
attrition prediction
- about / Spark for attrition prediction
- use case / The use case
- Spark computing / Spark computing
- Spark conputing / Spark computing
attrition prediction, methods
- about / Methods of attrition prediction
- regression models / Regression models
- decisiom trees / Decision trees
automation
- about / Repeatability and automation
- datasets preprocessing, workflows / Dataset preprocessing workflows
autoregressive-moving average (ARMA) / About time series
autoregressive integrated moving average (ARIMA) / About time series

B

Berkeley Data Analytics Stack (BDAS)
- about / Data cleaning in Spark
broadcast variables
- about / The Spark computing framework

C

churn prediction
- with Spark / Spark for churn prediction
- use case / The use case
- parallel computing / Spark computing
- feature preparation / Feature preparation
- model estimation / Model estimation
- Spark implementation, with MLlib / Spark implementation with MLlib
- model evaluation / Model evaluation
- results, explaining / Results explanation
- impact of interventions, calculating / Calculating the impact of interventions
- deployment / Deployment
- scoring / Scoring
- intervention recommendations / Intervention recommendations
churn prediction, feature preparation
- feature extraction / Feature extraction
- feature selection / Feature selection
churn prediction, methods
- about / Methods for churn prediction
- regression models / Regression models
- decision trees / Decision trees and Random forest
- Random Forest / Decision trees and Random forest
cluster analysis
- reference link / Cluster analysis
confusion matrix
- about / Confusion matrix and false positive ratios
- and error ratios / The confusion matrix and error ratios
Cross Industry Standard Process for Data Mining (CRISP-DM)
- about / ML as a step-by-step workflow

D

data
- preparing / Data and feature preparation
- merging / Data merging
data and feature preparation
- about / Data and feature preparation
- OpenRefine, using / OpenRefine
Databricks notebook
- about / Spark notebooks
- URL / Spark notebooks
DataBricks Workspace
- URL / Exploring and visualizing datasets
data cleaning
- about / Data cleaning
- data incompleteness, dealing with / Dealing with data incompleteness
- in Spark / Data cleaning in Spark
- with SampleClean / Data cleaning made easy
DataFrame
- about / Dataset joining and its tool – the Spark SQL
dataframe API
- for R / Dataframes API for R
- URL / Dataframes API for R
DataScientistWorkbench
- about / Apache Spark notebooks
Data Scientist WorkBench
- URL / Data cleaning
dataset reorganization
- about / Dataset reorganizing
- tasks / Dataset reorganizing tasks
- with Spark SQL / Dataset reorganizing with Spark SQL
- with R / Dataset reorganizing with R on Spark
datasets
- loading / Accessing and loading datasets, Loading datasets into Spark
- accessing / Accessing publicly available datasets
- references / Accessing publicly available datasets
- exploring / Exploring and visualizing datasets
- visualizing / Exploring and visualizing datasets
- joining / Dataset joining
- joining, with Spark SQL / Dataset joining and its tool – the Spark SQL
- joining, in Spark / Dataset joining in Spark, Dataset joining with the R data table package
datasets preprocessing
- workflows / Dataset preprocessing workflows
- with Spark pipeline / Spark pipelines for dataset preprocessing
- automation / Dataset preprocessing automation
data treatment, with SPSS
- about / Data treatment with SPSS
- data nodes, missing on SPSS modeler / Missing data nodes on SPSS modeler
decision tree
- URL / Decision tree and random forest
decision trees
- about / Decision trees, Decision trees
- for churn prediction / Decision trees and Random forest
- URL / Decision trees and Random forest
- code, preparing for / Preparing for coding
deployment
- about / Deployment
- rules / Rules
deployment, holistic view
- about / Deployment
- dashboard / Dashboard
- rules / Rules
deployment, open data
- about / Deployment
deployment, risk scoring
- about / Deployment
- scoring / Scoring
Directed Acyclic Graph (DAG)
- about / Spark advantages, ML workflow examples
distributed computing
- about / Distributed computing

E

entity resolution
- about / Entity resolution
- short string comparison / Short string comparison
- long string comparison / Long string comparison
- record deduplication / Record deduplication

F

False Negative (Type I Error) / Model evaluation
False Positive (FP) error rate / ROC
False Positive (Type II Error) / Model evaluation
false positive ratios
- about / Confusion matrix and false positive ratios
feature
- preparing / Data and feature preparation
- selecting / Feature selection
feature development, Telco Data
- about / Data and feature development
- data, reorganizing / Data reorganizing
- feature selection / Feature development and selection
feature extraction
- about / Feature extraction
- challenges / Feature development challenges
- with Spark MLlib / Feature development with Spark MLlib
- with R / Feature development with R
- preparation / Feature preparation
- from LogFile / Feature extraction from LogFile
- data, merging / Data merging
feature preparation
- about / Feature preparation
- feature development / Feature development
- feature selection / Feature selection
feature preparation, holistic view
- about / Feature preparation
- PCA / PCA
- grouping by category / Grouping by category to use subject knowledge
- feature selection / Feature selection
feature preparation, open data
- about / Data and feature preparation
- data, cleaning / Data cleaning
- data, merging / Data merging
- feature development / Feature development
- feature selection / Feature selection
FORECAST R package
- reference link / RMSE calculation with R
fraud detection
- about / Spark for fraud detection
- use case / The use case
- distributed computing / Distributed computing
- methods / Methods for fraud detection
- Random forest / Random forest
- decision trees / Decision trees
- deploying / Deploying fraud detection
- rules / Rules
- scoring / Scoring

G

GraphX
- about / Spark overview

H

holistic view, Spark
- about / Spark for a holistic view
- use case / The use case
- fast and easy computing / Fast and easy computing
- methods / Methods for a holistic view

I

IBM Data Scientist Workbench
- reference / Apache Spark notebooks
- URL / Spark computing
IBM Predictive Extensions
- installing / SPSS on Spark
IBM SystemML
- URL / Other ML libraries
identity matching
- about / Identity matching
- identity issues / Identity issues
- on Spark / Identity matching on Spark
- entity resolution / Entity resolution
- with SampleClean / Identity matching made better
- crowdsourced deduplication / Crowdsourced deduplication
- crowd, configuring / Configuring the crowd
- crowd, using / Using the crowd

J

Jupyter notebook
- reference / Apache Spark notebooks

K

Knitr package
- installing / Step 2: Installing the Knitr package
Kolmogorov-Smirnov (KS) / Kolmogorov-Smirnov

L

Last Observation Carried Forward (LOCF)
- about / Dataset joining with the R data table package
linear regression
- about / Regression models
/ About regression
LogFile
- feature extraction / Feature extraction from LogFile
logistic regression
- about / Regression models
logistic regression / About regression

M

machine learning
- Spark, computing / Spark computing for machine learning
machine learning (ML)
- notebook approach / Notebook approach for ML
machine learning algorithms
- about / Machine learning algorithms
machine learning methods, Telco Data
- about / Methods for learning from Telco Data
- descriptive statistics / Descriptive statistics and visualization
- visualization / Descriptive statistics and visualization
- linear regression model / Linear and logistic regression models
- logistic regression model / Linear and logistic regression models
- random forest / Decision tree and random forest
- decision tree / Decision tree and random forest
methods, for holistic view
- about / Methods for a holistic view
- regression modeling / Regression modeling
- SEM approach / The SEM approach
- decision trees / Decision trees
methods, for recommendation
- about / Methods for recommendation
- collaborative filtering / Collaborative filtering
- coding, preparing / Preparing coding
methods, for risk scoring
- logistic regression / Logistic regression
- coding, preparing in R / Preparing coding in R
- Random Forest / Random forest and decision trees
- decision trees / Random forest and decision trees
- coding, preparing / Preparing coding
ML frameworks
- about / ML frameworks, RM4Es and Spark computing, ML frameworks
MLlib
- about / MLlib, Feature development
- URL / MLlib, Principal components analysis
- SystemML / Other ML libraries
- implementing, for model estimation / MLlib implementation
- URL, for feature selection / Feature selection
- used, for RMSE calculation / RMSE calculation with MLlib
Mllib
- URL / PCA
MLlib, parameters
- numBlocks / Collaborative filtering
- rank / Collaborative filtering
- iterations / Collaborative filtering
- lambda / Collaborative filtering
- implicitPrefs / Collaborative filtering
- alpha / Collaborative filtering
MLlib - PMML model export
- URL / Deployment
MLlib feature extraction
- URL / Feature extraction
MLlib guide
- reference / Collaborative filtering
ML workflows
- about / ML workflows and Spark pipelines, ML as a step-by-step workflow
- examples / ML workflow examples
model deployment, Telco Data
- about / Model deployment
- alerts, sending / Rules to send out alerts
- scores, producing / Scores subscribers for churn and for Call Center calls
- purchase propensity, predicting / Scores subscribers for purchase propensity
model estimation
- about / Model estimation, Model estimation, Model estimation
- MLlib, implementing / MLlib implementation
- R notebooks, implementing / R notebooks implementation
- Spark implementation, Zeppelin notebook used / Spark implementation with the Zeppelin notebook
- Spark implementation, with Zeppelin notebook / Spark implementation with the Zeppelin notebook
- Spark implementation, with R notebook / Spark implementation with the R notebook
model estimation, holistic view
- about / Model estimation
- MLlib implementation / MLlib implementation
- R notebooks implementation / The R notebooks' implementation
model estimation, open data
- about / Model estimation
- SPSS Analytics Server / SPSS on Spark – SPSS Analytics Server
- model evaluation / Model evaluation
- RMSE, calculating with MLlib / RMSE calculations with MLlib
- RMSE, calculating with R / RMSE calculations with R
model estimation, recommendation
- about / Model estimation
- SPSS on Spark / SPSS on Spark – the SPSS Analytics server
model estimation, risk scoring
- about / Model estimation
- DataScientistWorkbench for R Notebooks / The DataScientistWorkbench for R notebooks
- R Notebooks implementation / R notebooks implementation
model estimation, Telco Data
- about / Model estimation
- SPSS Analytics Server / SPSS on Spark – SPSS Analytics Server
model evaluation
- about / Model evaluation, Model evaluation, A quick evaluation, Model evaluation
- performing / A quick evaluation
- confusion matrix / Confusion matrix and false positive ratios
- false positive ratios / Confusion matrix and false positive ratios
- confusion matrix and error ratios / The confusion matrix and error ratios
- RMSE calculation, with MLlib / RMSE calculation with MLlib
- RMSE calculation, with R / RMSE calculation with R
model evaluation, holistic view
- about / Model evaluation
- quick evaluations / Quick evaluations
- RMSE / RMSE
- ROC curves / ROC curves
model evaluation, recommendation
- about / Model evaluation
model evaluation, risk scoring
- about / Model evaluation
- confusion matrix / Confusion matrix
- ROC / ROC
- Kolmogorov-Smirnov (KS) / Kolmogorov-Smirnov
model evaluation, Telco Data
- about / Model evaluation
- RMSE, calculating with MLlib / RMSE calculations with MLlib
- RMSE, calculating with R / RMSE calculations with R
- error ratios, calculating with MLlib / Confusion matrix and error ratios with MLlib and R
- confusion matrix, calculating with R / Confusion matrix and error ratios with MLlib and R

N

notebook approach
- for machine learning (ML) / Notebook approach for ML

O

open data
- use case / Spark for learning from open data, The use case
- reference link / The use case
- Spark, computing / Spark computing
- scoring / Methods for scoring and ranking
- ranking / Methods for scoring and ranking
- cluster analysis / Cluster analysis
- principal component analysis (PCA) / Principal component analysis
- regression models / Regression models
- score, resembling / Score resembling
OpenRefine
- about / OpenRefine
- URL / Data cleaning

P

PCA
- about / PCA
- reference / PCA
PipelineStages
- about / ML workflow examples
Predictive Model Markup Language (PMML)
- about / Deployment, Deploying fraud detection, Deployment, Model deployment, Deployment
/ Deployment, Visualizing trends
Principal Component Analysis (PCA) / Feature selection
- about / Feature development and selection
principal component analysis (PCA)
- about / Principal component analysis
- URL / Principal component analysis
Principal components analysis (PCA)
- about / Principal components analysis
- Subject knowledge aid / Subject knowledge aid

R

R
- dataframe API / Dataframes API for R
- dataset reorganization / Dataset reorganizing with R on Spark
- feature extraction / Feature development with R
- used, for RMSE calculation / RMSE calculation with R
Random forest
- about / Random forest
- reference link / Random forest
Random Forest
- for churn prediction / Decision trees and Random forest
- URL / Decision trees and Random forest
random forest
- URL / Decision tree and random forest
Receiver Operating Characteristic curve (ROC) / ROC
recommendation deployment
- about / Recommendation deployment
recommendations, on Spark
- Spark, for recommendation engine / Apache Spark for a recommendation engine
regression models
- for churn prediction / Regression models
- linear regression / Regression models, About regression, About regression
- logistic regression / Regression models, About regression, About regression
- about / Regression models, Regression models
- code, preparing for / Preparing for coding
- coding, preparation steps / Preparing for coding
repeatability
- about / Repeatability and automation
ReporteRs R package
- URL / Feature development with R
Research Methods Four Elements (RM4Es)
- about / RM4Es
- Equation / RM4Es
- Estimation / RM4Es
- Evaluation / RM4Es
- Explanation / RM4Es
Resilient Distributed Dataset (RDD)
- about / Spark advantages, Spark RDD
results
- about / Results explanation
- interventions impact, calculating / Calculating the impact of interventions
- main causes impact, calculating / Calculating the impact of main causes
- scoring / Scoring
- explanation / Explanations of the results
- biggest influencers / Biggest influencers
- trends, visualizing / Visualizing trends
results, open data
- about / Results explanation
- ranks, comparing / Comparing ranks
- impacts, predicting / Biggest influencers
- alerts, sending / Rules for sending out alerts
- school districts, ranking / Scores for ranking school districts
results, Telco Data
- about / Results explanation
- descriptive statistics / Descriptive statistics and visualizations
- visualizations / Descriptive statistics and visualizations
- impacts, analizing / Biggest influencers
- insights / Special insights
- trends, visualizing / Visualizing trends
results explanation
- about / Results explanation
- influencing variables / Big influencers and their impacts
results explanation, holistic view
- about / Results explanation
- impacts assessments / Impact assessments
results explanation, risk scoring
- about / Results explanation
- big influencers / Big influencers and their impacts
risk scoring
- methods / Methods of risk scoring
R Markdown
- about / Notebook approach for ML
- R studio, downloading / Step 1: Getting the software ready
- Knitr package, installing / Step 2: Installing the Knitr package
- report, creating / Step 3: Creating a simple report
RMSE (Root-Mean-Square Error)
- about / Model evaluation, RMSE
- example / RMSE
RMSE calculation
- with MLlib / RMSE calculation with MLlib
- with R / RMSE calculation with R
R notebook
- references / Apache Spark notebooks
- used, for Spark implementation / Spark implementation with the R notebook
R notebooks
- implementing, for model estimation / R notebooks implementation
R Notebooks implementation
- about / R notebooks implementation
- logistic regression / R notebooks implementation
- Random Forest / R notebooks implementation
- decision tree / R notebooks implementation
ROC (Receiver Operating Characteristic)
- about / Model evaluation
ROCR
- URL / A quick evaluation
Root Mean Square Error (RMSE)
- about / Model evaluation, Model evaluation
R package PMML
- reference / Deployment
R studio
- URL / Step 1: Getting the software ready

S

SampleClean
- used, for data cleaning / Data cleaning made easy
- URL / Data cleaning made easy, Record deduplication
- used, for identity matching / Identity matching made better
service forecasting, Spark used
- about / Spark for service forecasting
- use case / The use case
- use case, reference links / The use case
- computing / Spark computing
- methods / Methods of service forecasting
- regression models / Regression models
shared variables
- broadcast variables / The Spark computing framework
- accumulators / The Spark computing framework
Spark
- overview / Spark overview and Spark advantages, Spark overview
- advantages / Spark overview and Spark advantages, Spark advantages
- URL / Spark overview
- URL, for documentation / Spark overview
- reference link / Spark advantages
- computing, for machine learning / Spark computing for machine learning
- holistic view / Spark for a holistic view
- used, for service forecasting / Spark for service forecasting
Spark, for recommendation engine
- use case / The use case
- SPSS on Spark / SPSS on Spark
Spark, for risk scoring
- about / Spark for risk scoring
- use case / The use case
- Apache Spark Notebooks / Apache Spark notebooks
spark-ts library
- reference link / Preparing for coding
Spark computing
- about / Spark computing
Spark computing framework
- about / The Spark computing framework
Spark dataframe
- about / Spark dataframes
- URL / Spark dataframes
Spark DataSource API
- URL / Loading datasets into Spark
Spark implementation
- Zeppelin notebook, using / Spark implementation with the Zeppelin notebook
- R notebook, using / Spark implementation with the R notebook
Spark MLlib
- feature extraction / Feature development with Spark MLlib
- URL / Feature development with Spark MLlib
Spark notebooks
- about / Spark notebooks
- notebook approach, for machine learning (ML) / Notebook approach for ML
- Databricks notebook / Spark notebooks
Spark pipeline
- about / ML workflows and Spark pipelines
- URL / ML workflow examples
- used, for datasets preprocessing / Spark pipelines for dataset preprocessing
Spark RDD
- about / Spark RDD
- URL / Spark RDD
SparkSQL
- about / Feature extraction from LogFile
Spark SQL
- used, for dataset reorganization / Dataset reorganizing with Spark SQL
- URL / Dataset reorganizing with Spark SQL, Dataset joining in Spark
- datasets, joining / Dataset joining and its tool – the Spark SQL
SPSS Analytics Server
- about / SPSS on Spark – SPSS Analytics Server
SPSS Analytics server / SPSS on Spark – the SPSS Analytics server
SPSS on Spark / SPSS on Spark
SQLContext
- about / Dataset joining and its tool – the Spark SQL
Structural Equation Modeling (SEM) / The SEM approach
SystemML
- about / Other ML libraries

T

Telco Data
- using / Spark for using Telco Data
- use case / The use case
- Spark, computing for / Spark computing
- machine learning methods / Methods for learning from Telco Data
time series modeling
- about / Time series modeling
- reference link / About time series
- coding, preparation steps / Preparing for coding
trends, visualizing
- about / Visualizing trends
- sending out alerts, rules / The rules of sending out alerts
- city zones, ranking scores / Scores to rank city zones
True Positive (TP) error rate / ROC

Z

Zeppelin
- URL / Distributed computing
Zeppeline notebook / Spark computing
Zeppelin notebook
- URL / Spark computing
- used, for implementing notebook / Spark implementation with the Zeppelin notebook
- used, for Spark implementation / Spark implementation with the Zeppelin notebook
Zepperlin / Apache Spark notebooks