Index
A
- accumulators
- about / The Spark computing framework
- ACT
- URL / The use case
- Alternating Least Squares (ALS) algorithm / Collaborative filtering
- Apache Spark
- URL / Spark computing
- Apache Spark Notebooks
- about / Apache Spark notebooks
- attrition prediction
- about / Spark for attrition prediction
- use case / The use case
- Spark computing / Spark computing
- Spark conputing / Spark computing
- attrition prediction, methods
- about / Methods of attrition prediction
- regression models / Regression models
- decisiom trees / Decision trees
- automation
- about / Repeatability and automation
- datasets preprocessing, workflows / Dataset preprocessing workflows
- autoregressive-moving average (ARMA) / About time series
- autoregressive integrated moving average (ARIMA) / About time series
B
- Berkeley Data Analytics Stack (BDAS)
- about / Data cleaning in Spark
- broadcast variables
- about / The Spark computing framework
C
- churn prediction
- with Spark / Spark for churn prediction
- use case / The use case
- parallel computing / Spark computing
- feature preparation / Feature preparation
- model estimation / Model estimation
- Spark implementation, with MLlib / Spark implementation with MLlib
- model evaluation / Model evaluation
- results, explaining / Results explanation
- impact of interventions, calculating / Calculating the impact of interventions
- deployment / Deployment
- scoring / Scoring
- intervention recommendations / Intervention recommendations
- churn prediction, feature preparation
- feature extraction / Feature extraction
- feature selection / Feature selection
- churn prediction, methods
- about / Methods for churn prediction
- regression models / Regression models
- decision trees / Decision trees and Random forest
- Random Forest / Decision trees and Random forest
- cluster analysis
- reference link / Cluster analysis
- confusion matrix
- about / Confusion matrix and false positive ratios
- and error ratios / The confusion matrix and error ratios
- Cross Industry Standard Process for Data Mining (CRISP-DM)
- about / ML as a step-by-step workflow
D
- data
- preparing / Data and feature preparation
- merging / Data merging
- data and feature preparation
- about / Data and feature preparation
- OpenRefine, using / OpenRefine
- Databricks notebook
- about / Spark notebooks
- URL / Spark notebooks
- DataBricks Workspace
- data cleaning
- about / Data cleaning
- data incompleteness, dealing with / Dealing with data incompleteness
- in Spark / Data cleaning in Spark
- with SampleClean / Data cleaning made easy
- DataFrame
- dataframe API
- for R / Dataframes API for R
- URL / Dataframes API for R
- DataScientistWorkbench
- about / Apache Spark notebooks
- Data Scientist WorkBench
- URL / Data cleaning
- dataset reorganization
- about / Dataset reorganizing
- tasks / Dataset reorganizing tasks
- with Spark SQL / Dataset reorganizing with Spark SQL
- with R / Dataset reorganizing with R on Spark
- datasets
- loading / Accessing and loading datasets, Loading datasets into Spark
- accessing / Accessing publicly available datasets
- references / Accessing publicly available datasets
- exploring / Exploring and visualizing datasets
- visualizing / Exploring and visualizing datasets
- joining / Dataset joining
- joining, with Spark SQL / Dataset joining and its tool – the Spark SQL
- joining, in Spark / Dataset joining in Spark, Dataset joining with the R data table package
- datasets preprocessing
- workflows / Dataset preprocessing workflows
- with Spark pipeline / Spark pipelines for dataset preprocessing
- automation / Dataset preprocessing automation
- data treatment, with SPSS
- about / Data treatment with SPSS
- data nodes, missing on SPSS modeler / Missing data nodes on SPSS modeler
- decision tree
- decision trees
- about / Decision trees, Decision trees
- for churn prediction / Decision trees and Random forest
- URL / Decision trees and Random forest
- code, preparing for / Preparing for coding
- deployment
- about / Deployment
- rules / Rules
- deployment, holistic view
- about / Deployment
- dashboard / Dashboard
- rules / Rules
- deployment, open data
- about / Deployment
- deployment, risk scoring
- about / Deployment
- scoring / Scoring
- Directed Acyclic Graph (DAG)
- about / Spark advantages, ML workflow examples
- distributed computing
- about / Distributed computing
E
- entity resolution
- about / Entity resolution
- short string comparison / Short string comparison
- long string comparison / Long string comparison
- record deduplication / Record deduplication
F
- False Negative (Type I Error) / Model evaluation
- False Positive (FP) error rate / ROC
- False Positive (Type II Error) / Model evaluation
- false positive ratios
- feature
- preparing / Data and feature preparation
- selecting / Feature selection
- feature development, Telco Data
- about / Data and feature development
- data, reorganizing / Data reorganizing
- feature selection / Feature development and selection
- feature extraction
- about / Feature extraction
- challenges / Feature development challenges
- with Spark MLlib / Feature development with Spark MLlib
- with R / Feature development with R
- preparation / Feature preparation
- from LogFile / Feature extraction from LogFile
- data, merging / Data merging
- feature preparation
- about / Feature preparation
- feature development / Feature development
- feature selection / Feature selection
- feature preparation, holistic view
- about / Feature preparation
- PCA / PCA
- grouping by category / Grouping by category to use subject knowledge
- feature selection / Feature selection
- feature preparation, open data
- about / Data and feature preparation
- data, cleaning / Data cleaning
- data, merging / Data merging
- feature development / Feature development
- feature selection / Feature selection
- FORECAST R package
- reference link / RMSE calculation with R
- fraud detection
- about / Spark for fraud detection
- use case / The use case
- distributed computing / Distributed computing
- methods / Methods for fraud detection
- Random forest / Random forest
- decision trees / Decision trees
- deploying / Deploying fraud detection
- rules / Rules
- scoring / Scoring
G
- GraphX
- about / Spark overview
H
- holistic view, Spark
- about / Spark for a holistic view
- use case / The use case
- fast and easy computing / Fast and easy computing
- methods / Methods for a holistic view
I
- IBM Data Scientist Workbench
- reference / Apache Spark notebooks
- URL / Spark computing
- IBM Predictive Extensions
- installing / SPSS on Spark
- IBM SystemML
- URL / Other ML libraries
- identity matching
- about / Identity matching
- identity issues / Identity issues
- on Spark / Identity matching on Spark
- entity resolution / Entity resolution
- with SampleClean / Identity matching made better
- crowdsourced deduplication / Crowdsourced deduplication
- crowd, configuring / Configuring the crowd
- crowd, using / Using the crowd
J
- Jupyter notebook
- reference / Apache Spark notebooks
K
- Knitr package
- installing / Step 2: Installing the Knitr package
- Kolmogorov-Smirnov (KS) / Kolmogorov-Smirnov
L
- Last Observation Carried Forward (LOCF)
- linear regression
- about / Regression models
- LogFile
- feature extraction / Feature extraction from LogFile
- logistic regression
- about / Regression models
- logistic regression / About regression
M
- machine learning
- Spark, computing / Spark computing for machine learning
- machine learning (ML)
- notebook approach / Notebook approach for ML
- machine learning algorithms
- about / Machine learning algorithms
- machine learning methods, Telco Data
- about / Methods for learning from Telco Data
- descriptive statistics / Descriptive statistics and visualization
- visualization / Descriptive statistics and visualization
- linear regression model / Linear and logistic regression models
- logistic regression model / Linear and logistic regression models
- random forest / Decision tree and random forest
- decision tree / Decision tree and random forest
- methods, for holistic view
- about / Methods for a holistic view
- regression modeling / Regression modeling
- SEM approach / The SEM approach
- decision trees / Decision trees
- methods, for recommendation
- about / Methods for recommendation
- collaborative filtering / Collaborative filtering
- coding, preparing / Preparing coding
- methods, for risk scoring
- logistic regression / Logistic regression
- coding, preparing in R / Preparing coding in R
- Random Forest / Random forest and decision trees
- decision trees / Random forest and decision trees
- coding, preparing / Preparing coding
- ML frameworks
- MLlib
- about / MLlib, Feature development
- URL / MLlib, Principal components analysis
- SystemML / Other ML libraries
- implementing, for model estimation / MLlib implementation
- URL, for feature selection / Feature selection
- used, for RMSE calculation / RMSE calculation with MLlib
- Mllib
- URL / PCA
- MLlib, parameters
- numBlocks / Collaborative filtering
- rank / Collaborative filtering
- iterations / Collaborative filtering
- lambda / Collaborative filtering
- implicitPrefs / Collaborative filtering
- alpha / Collaborative filtering
- MLlib - PMML model export
- URL / Deployment
- MLlib feature extraction
- URL / Feature extraction
- MLlib guide
- reference / Collaborative filtering
- ML workflows
- about / ML workflows and Spark pipelines, ML as a step-by-step workflow
- examples / ML workflow examples
- model deployment, Telco Data
- about / Model deployment
- alerts, sending / Rules to send out alerts
- scores, producing / Scores subscribers for churn and for Call Center calls
- purchase propensity, predicting / Scores subscribers for purchase propensity
- model estimation
- about / Model estimation, Model estimation, Model estimation
- MLlib, implementing / MLlib implementation
- R notebooks, implementing / R notebooks implementation
- Spark implementation, Zeppelin notebook used / Spark implementation with the Zeppelin notebook
- Spark implementation, with Zeppelin notebook / Spark implementation with the Zeppelin notebook
- Spark implementation, with R notebook / Spark implementation with the R notebook
- model estimation, holistic view
- about / Model estimation
- MLlib implementation / MLlib implementation
- R notebooks implementation / The R notebooks' implementation
- model estimation, open data
- about / Model estimation
- SPSS Analytics Server / SPSS on Spark – SPSS Analytics Server
- model evaluation / Model evaluation
- RMSE, calculating with MLlib / RMSE calculations with MLlib
- RMSE, calculating with R / RMSE calculations with R
- model estimation, recommendation
- about / Model estimation
- SPSS on Spark / SPSS on Spark – the SPSS Analytics server
- model estimation, risk scoring
- about / Model estimation
- DataScientistWorkbench for R Notebooks / The DataScientistWorkbench for R notebooks
- R Notebooks implementation / R notebooks implementation
- model estimation, Telco Data
- about / Model estimation
- SPSS Analytics Server / SPSS on Spark – SPSS Analytics Server
- model evaluation
- about / Model evaluation, Model evaluation, A quick evaluation, Model evaluation
- performing / A quick evaluation
- confusion matrix / Confusion matrix and false positive ratios
- false positive ratios / Confusion matrix and false positive ratios
- confusion matrix and error ratios / The confusion matrix and error ratios
- RMSE calculation, with MLlib / RMSE calculation with MLlib
- RMSE calculation, with R / RMSE calculation with R
- model evaluation, holistic view
- about / Model evaluation
- quick evaluations / Quick evaluations
- RMSE / RMSE
- ROC curves / ROC curves
- model evaluation, recommendation
- about / Model evaluation
- model evaluation, risk scoring
- about / Model evaluation
- confusion matrix / Confusion matrix
- ROC / ROC
- Kolmogorov-Smirnov (KS) / Kolmogorov-Smirnov
- model evaluation, Telco Data
- about / Model evaluation
- RMSE, calculating with MLlib / RMSE calculations with MLlib
- RMSE, calculating with R / RMSE calculations with R
- error ratios, calculating with MLlib / Confusion matrix and error ratios with MLlib and R
- confusion matrix, calculating with R / Confusion matrix and error ratios with MLlib and R
N
- notebook approach
- for machine learning (ML) / Notebook approach for ML
O
- open data
- use case / Spark for learning from open data, The use case
- reference link / The use case
- Spark, computing / Spark computing
- scoring / Methods for scoring and ranking
- ranking / Methods for scoring and ranking
- cluster analysis / Cluster analysis
- principal component analysis (PCA) / Principal component analysis
- regression models / Regression models
- score, resembling / Score resembling
- OpenRefine
- about / OpenRefine
- URL / Data cleaning
P
- PCA
- PipelineStages
- about / ML workflow examples
- Predictive Model Markup Language (PMML) / Deployment, Visualizing trends
- Principal Component Analysis (PCA) / Feature selection
- principal component analysis (PCA)
- about / Principal component analysis
- URL / Principal component analysis
- Principal components analysis (PCA)
- about / Principal components analysis
- Subject knowledge aid / Subject knowledge aid
R
- R
- dataframe API / Dataframes API for R
- dataset reorganization / Dataset reorganizing with R on Spark
- feature extraction / Feature development with R
- used, for RMSE calculation / RMSE calculation with R
- Random forest
- about / Random forest
- reference link / Random forest
- Random Forest
- for churn prediction / Decision trees and Random forest
- URL / Decision trees and Random forest
- random forest
- Receiver Operating Characteristic curve (ROC) / ROC
- recommendation deployment
- about / Recommendation deployment
- recommendations, on Spark
- Spark, for recommendation engine / Apache Spark for a recommendation engine
- regression models
- for churn prediction / Regression models
- linear regression / Regression models, About regression, About regression
- logistic regression / Regression models, About regression, About regression
- about / Regression models, Regression models
- code, preparing for / Preparing for coding
- coding, preparation steps / Preparing for coding
- repeatability
- about / Repeatability and automation
- ReporteRs R package
- Research Methods Four Elements (RM4Es)
- Resilient Distributed Dataset (RDD)
- about / Spark advantages, Spark RDD
- results
- about / Results explanation
- interventions impact, calculating / Calculating the impact of interventions
- main causes impact, calculating / Calculating the impact of main causes
- scoring / Scoring
- explanation / Explanations of the results
- biggest influencers / Biggest influencers
- trends, visualizing / Visualizing trends
- results, open data
- about / Results explanation
- ranks, comparing / Comparing ranks
- impacts, predicting / Biggest influencers
- alerts, sending / Rules for sending out alerts
- school districts, ranking / Scores for ranking school districts
- results, Telco Data
- about / Results explanation
- descriptive statistics / Descriptive statistics and visualizations
- visualizations / Descriptive statistics and visualizations
- impacts, analizing / Biggest influencers
- insights / Special insights
- trends, visualizing / Visualizing trends
- results explanation
- about / Results explanation
- influencing variables / Big influencers and their impacts
- results explanation, holistic view
- about / Results explanation
- impacts assessments / Impact assessments
- results explanation, risk scoring
- about / Results explanation
- big influencers / Big influencers and their impacts
- risk scoring
- methods / Methods of risk scoring
- R Markdown
- about / Notebook approach for ML
- R studio, downloading / Step 1: Getting the software ready
- Knitr package, installing / Step 2: Installing the Knitr package
- report, creating / Step 3: Creating a simple report
- RMSE (Root-Mean-Square Error)
- about / Model evaluation, RMSE
- example / RMSE
- RMSE calculation
- with MLlib / RMSE calculation with MLlib
- with R / RMSE calculation with R
- R notebook
- references / Apache Spark notebooks
- used, for Spark implementation / Spark implementation with the R notebook
- R notebooks
- implementing, for model estimation / R notebooks implementation
- R Notebooks implementation
- about / R notebooks implementation
- logistic regression / R notebooks implementation
- Random Forest / R notebooks implementation
- decision tree / R notebooks implementation
- ROC (Receiver Operating Characteristic)
- about / Model evaluation
- ROCR
- URL / A quick evaluation
- Root Mean Square Error (RMSE)
- about / Model evaluation, Model evaluation
- R package PMML
- reference / Deployment
- R studio
S
- SampleClean
- used, for data cleaning / Data cleaning made easy
- URL / Data cleaning made easy, Record deduplication
- used, for identity matching / Identity matching made better
- service forecasting, Spark used
- about / Spark for service forecasting
- use case / The use case
- use case, reference links / The use case
- computing / Spark computing
- methods / Methods of service forecasting
- regression models / Regression models
- shared variables
- broadcast variables / The Spark computing framework
- accumulators / The Spark computing framework
- Spark
- overview / Spark overview and Spark advantages, Spark overview
- advantages / Spark overview and Spark advantages, Spark advantages
- URL / Spark overview
- URL, for documentation / Spark overview
- reference link / Spark advantages
- computing, for machine learning / Spark computing for machine learning
- holistic view / Spark for a holistic view
- used, for service forecasting / Spark for service forecasting
- Spark, for recommendation engine
- use case / The use case
- SPSS on Spark / SPSS on Spark
- Spark, for risk scoring
- about / Spark for risk scoring
- use case / The use case
- Apache Spark Notebooks / Apache Spark notebooks
- spark-ts library
- reference link / Preparing for coding
- Spark computing
- about / Spark computing
- Spark computing framework
- about / The Spark computing framework
- Spark dataframe
- about / Spark dataframes
- URL / Spark dataframes
- Spark DataSource API
- Spark implementation
- Zeppelin notebook, using / Spark implementation with the Zeppelin notebook
- R notebook, using / Spark implementation with the R notebook
- Spark MLlib
- feature extraction / Feature development with Spark MLlib
- URL / Feature development with Spark MLlib
- Spark notebooks
- about / Spark notebooks
- notebook approach, for machine learning (ML) / Notebook approach for ML
- Databricks notebook / Spark notebooks
- Spark pipeline
- about / ML workflows and Spark pipelines
- URL / ML workflow examples
- used, for datasets preprocessing / Spark pipelines for dataset preprocessing
- Spark RDD
- SparkSQL
- about / Feature extraction from LogFile
- Spark SQL
- used, for dataset reorganization / Dataset reorganizing with Spark SQL
- URL / Dataset reorganizing with Spark SQL, Dataset joining in Spark
- datasets, joining / Dataset joining and its tool – the Spark SQL
- SPSS Analytics Server
- SPSS Analytics server / SPSS on Spark – the SPSS Analytics server
- SPSS on Spark / SPSS on Spark
- SQLContext
- Structural Equation Modeling (SEM) / The SEM approach
- SystemML
- about / Other ML libraries
T
- Telco Data
- using / Spark for using Telco Data
- use case / The use case
- Spark, computing for / Spark computing
- machine learning methods / Methods for learning from Telco Data
- time series modeling
- about / Time series modeling
- reference link / About time series
- coding, preparation steps / Preparing for coding
- trends, visualizing
- about / Visualizing trends
- sending out alerts, rules / The rules of sending out alerts
- city zones, ranking scores / Scores to rank city zones
- True Positive (TP) error rate / ROC
Z
- Zeppelin
- URL / Distributed computing
- Zeppeline notebook / Spark computing
- Zeppelin notebook
- URL / Spark computing
- used, for implementing notebook / Spark implementation with the Zeppelin notebook
- used, for Spark implementation / Spark implementation with the Zeppelin notebook
- Zepperlin / Apache Spark notebooks