Book Image

Learning Apache Mahout

Book Image

Learning Apache Mahout

Overview of this book

Learning Apache Mahout

Learning Apache Mahout

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Introduction to Mahout

Introduction to Mahout

Core Concepts in Machine Learning

Core Concepts in Machine Learning

Supervised learning

Unsupervised learning

Recommender system

Feature Engineering

Feature Engineering

Feature engineering

Classification with Mahout

Classification with Mahout

Logistic regression

Adaptive regression model

Code example with logistic regression

Naïve Bayes classifier

Frequent Pattern Mining and Topic Modeling

Frequent Pattern Mining and Topic Modeling

Frequent pattern mining

Importing the Mahout source code into Eclipse

Frequent pattern mining with Mahout

Recommendation with Mahout

Recommendation with Mahout

Collaborative filtering

Clustering with Mahout

Clustering with Mahout

Canopy clustering

A Mahout command-line example

A Mahout Java example

New Paradigm in Mahout

New Paradigm in Mahout

Moving beyond MapReduce

Spark Mahout basics

Linear regression with Mahout Spark

Case Study – Churn Analytics and Customer Segmentation

Case Study – Churn Analytics and Customer Segmentation

Churn analytics

Case Study – Text Analytics

Case Study – Text Analytics

Clustering text

Categorizing text

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

Abalone dataset
- URL / Feature construction
accumulators
- about / Apache Spark
adjusted R-square
- about / Adjusted R-square
AUC
- about / ROC curve and AUC, Area-based accuracy measure
automated feature extraction
- about / Feature engineering

B

backward selection
- about / Backward selection
bagging
- about / Bagging
behavioral-based segmentation
- about / Customer segmentation
bias-variance trade-off
- about / Bias-variance trade-off
bigram
- about / n-grams
binarization
- about / Binarization
binning
- about / Binning
- unsupervised binning / Binning
- supervised binning / Binning
boosting
- about / Boosting
broadcast variables
- about / Apache Spark

C

canopy clustering
- about / Canopy clustering
- command-line options / Canopy clustering
categorical features
- about / Categorical features
- categories, merging / Merging categories
- categories, converting to binary variables / Converting to binary variables
- categories, converting to continuous variables / Converting to continuous variables
centroids
- determining / Deciding the initial centroid
- random points, generating / Random points
- random input points, selecting / Points from the dataset
- partition, by range / Partition by range
- canopy centroids, using / Canopy centroids
churn analytics
- about / Churn analytics
- data, obtaining / Getting the data
- data exploration / Data exploration
- feature engineering / Feature engineering
- model training and validation phase / Model training and validation
classification
- about / A classification example, Supervised learning, Classification
- common workflow / A classification example
- confusion matrix / Confusion matrix
- AUC / ROC curve and AUC
- ROC curve / ROC curve and AUC
cluster analysis
- about / Cluster analysis
- objective / Objective
- feature representation / Feature representation
- clustering algorithm, using / Algorithm for clustering
- stopping criteria / A stopping criteria
clustering
- about / A clustering example, Clustering
- internal evaluation / The internal evaluation
- external evaluation / The external evaluation
clustering, using Java code
- example / A Mahout Java example
- k-means, using / k-means
- cluster evaluation / Cluster evaluation
clustering, using Mahout command line
- example / A Mahout command-line example
- data, obtaining / Getting the data
- URL, for dataset / Getting the data
- data, preprocessing / Preprocessing the data
- k-means / k-means
- canopy clustering / Canopy clustering
- fuzzy k-means / Fuzzy k-means
- streaming k-means / Streaming k-means
clustering algorithm
- k-means / k-means
- canopy clustering / Canopy clustering
- fuzzy k-means / Fuzzy k-means
collaborative filtering
- about / Collaborative filtering, Collaborative filtering
- cold start / Cold start
- scalability / Scalability
- sparsity / Sparsity
- similarity / Similarity measures
- recommender systems / Evaluating recommender
- preferences / Inferring preferences
column normalization
- about / Column normalization
- rescaling / Rescaling
- standardization / Standardization
command line
- about / Mahout command line
- clustering example / A clustering example
- Reuter's raw data file / Reuter's raw data file
- example, for k-means clustering / Reuter's raw data file
- classification example / A classification example
command line, Mahout
- extending / Extending the command line of Mahout
- used, for implementing LDA / LDA using the Mahout command line
confusion matrix
- about / Confusion matrix
content-based filtering
- about / Content-based filtering
continuous features
- about / Continuous features
- binning / Binning
- binarization / Binarization
- feature standardization / Feature standardization
- mathematical transformations / Mathematical transformations
cosine distance measure
- about / Cosine distance measure
customer segmentation
- about / Customer segmentation
- value-based segmentation / Customer segmentation
- behavioral-based segmentation / Customer segmentation
- demographic-based segmentation / Customer segmentation
- preprocessing / Preprocessing

D

data exploration, churn analytics
- R, installing / Installing R
Davies-Bouldin index
- about / The Davies–Bouldin index
demographic-based segmentation
- about / Customer segmentation
dense vector
- about / Initializing a vector inline
development environment
- setting up / Setting up the development environment
- Maven, configuring / Configuring Maven
- Mahout, configuring / Configuring Mahout
- Eclipse, configuring / Configuring Eclipse with the Maven plugin and Mahout
dimensionality reduction
- about / Feature engineering, Dimensionality reduction
distance measure
- about / A notion of similarity and dissimilarity
- Euclidean distance measure / Euclidean distance measure
- squared Euclidean distance measure / Squared Euclidean distance measure
- Manhattan distance measure / Manhattan distance measure
- cosine distance measure / Cosine distance measure
- Tanimoto distance measure / Tanimoto distance measure
document indexing
- about / Document indexing
DRM
- about / Basics of Mahout Scala DSL
Dunn index
- about / The Dunn index

E

Eclipse
- configuring / Configuring Eclipse with the Maven plugin and Mahout
- Mahout source code, importing / Importing the Mahout source code into Eclipse
embedded feature selection
- about / Embedded feature selection
Euclidean distance measure
- about / Euclidean distance measure
Euclidean distance similarity
- about / Euclidean distance similarity
evaluation
- about / Evaluation
- bias-variance trade-off / Bias-variance trade-off
- function complexity / Function complexity and amount of training data
- training data consideration / Function complexity and amount of training data
- dimensionality, of input space / Dimensionality of the input space
- noise, in data / Noise in data
external evaluation, clustering
- about / The external evaluation
- Rand index / The Rand index
- F-measure / F-measure

F

F-measure
- about / F-measure
feature
- about / Feature engineering
feature construction
- about / Feature construction
- categorical features / Categorical features
- continuous features / Continuous features
feature engineering
- about / Feature engineering
- manual feature construction / Feature engineering
- automated feature extraction / Feature engineering
- feature selection / Feature engineering
- dimensionality reduction / Feature engineering
feature extraction
- about / Feature extraction
- techniques / Feature extraction
feature extraction, customer segmentation
- day calls / Day calls
- evening calls / Evening calls
- international calls / International calls
- files, preprocessing / Preprocessing the files
feature representation
- about / Feature representation
- feature normalization / Feature normalization
- similarity / A notion of similarity and dissimilarity
- dissimilarity / A notion of similarity and dissimilarity
- distance measure / A notion of similarity and dissimilarity
feature selection
- about / Feature engineering, Feature selection
- filter-based feature selection / Filter-based feature selection
- wrapper-based feature selection / Wrapper-based feature selection
- embedded feature selection / Embedded feature selection
feature standardization
- about / Feature standardization
- rescaling / Rescaling
- mean standardization / Mean standardization
- scaling / Scaling to unit norm
feature transformation
- about / Feature transformation derived from the problem domain
- ratios / Ratios
- frequency / Frequency
- aggregate transformations / Aggregate transformations
- normalization / Normalization
filter-based feature selection
- about / Filter-based feature selection
fixed size neighborhood
- about / Fixed size neighborhood
forward selection
- about / Forward selection
FP-Growth
- about / Frequent pattern mining
FP Tree
- about / Frequent pattern mining
- building / Building FP Tree
- constructing / Constructing the tree
- frequent patterns, identifying / Identifying frequent patterns from FP Tree
frequent pattern mining
- about / Frequent pattern mining
- rules, identifying / Measures for identifying interesting rules
- considerations / Things to consider
- FP-Growth / Frequent pattern mining
- FP Tree / Frequent pattern mining
- implementing, with Mahout / Frequent pattern mining with Mahout
- Mahout command line, extending / Extending the command line of Mahout
- data, obtaining / Getting the data
- data description / Data description
- implementing, with Mahout API / Frequent pattern mining with Mahout API
frequent pattern mining (FPM)
- about / Frequent pattern mining with Mahout
frequent pattern mining, considerations
- actionable rules / Actionable rules
- association, determining / What association to look for
frequent pattern mining, rules
- identifying / Measures for identifying interesting rules
- support / Support
- confidence / Confidence
- lift / Lift
- conviction / Conviction
frequent pattern mining, with Mahout API
- MapReduce execution / MapReduce execution
- linear execution / Linear execution
- results, formatting / Formatting the results and computing metrics
- metrics, computing / Formatting the results and computing metrics
fuzzy k-means
- about / Fuzzy k-means
- fuzzy factor, deciding / Deciding the fuzzy factor
- command-line options / Fuzzy k-means

H

Hadoop
- URL, for configuring / Setting up the development environment
Hadoop Distributed File System (HDFS) / Reuter's raw data file
holdout-set validation
- about / Holdout-set validation

I

in-core types
- about / In-core types
- vector / Vector
- matrix / Matrix
in-memory execution
- about / Parallel versus in-memory execution mode
- versus parallel execution / Parallel versus in-memory execution mode
installation, R
- about / Installing R
inter-cluster distance
- about / The inter-cluster distance
internal evaluation, clustering
- about / The internal evaluation
- intra-cluster distance / The intra-cluster distance
- inter-cluster distance / The inter-cluster distance
- Davies-Bouldin index / The Davies–Bouldin index
- Dunn index / The Dunn index
intra-cluster distance
- about / The intra-cluster distance
item-based recommender system
- about / Item-based recommender system
- example / Mahout code example
- recommender, building / Building the recommender
- recommender, evaluating / Evaluating the recommender

K

K-fold cross validation
- about / K-fold cross validation
k-means
- about / k-means
- number of clusters, determining / Deciding the number of clusters
- initial centroid, determining / Deciding the initial centroid
- advantages and disadvantages / Advantages and disadvantages
- command-line options / k-means

L

LDA
- used, for topic modeling / Topic modeling using LDA
- about / Topic modeling using LDA
- implementing, Mahout command line used / LDA using the Mahout command line
linear regression
- with Mahout Spark / Linear regression with Mahout Spark
log-likelihood similarity
- about / Log-likelihood similarity
log-likelihood test
- about / n-grams

M

machine learning
- supervised learning / Supervised learning
- unsupervised learning / Unsupervised learning
- recommender system / Recommender system
- model efficacy / Model efficacy
Mahout
- advantages / Why Mahout
- use case / When Mahout
- development environment, setting up / Setting up the development environment
- configuring / Configuring Mahout
- URL / Configuring Mahout, The classification job
- source code, importing into Eclipse / Importing the Mahout source code into Eclipse
- frequent pattern mining, implementing / Frequent pattern mining with Mahout
- Spark, configuring / Configuring Spark with Mahout
Mahout, advantages
- simple techniques / Simple techniques and more data is better
- better data collection / Simple techniques and more data is better
- sampling / Sampling is difficult
- license / Community and license
- community / Community and license
Mahout, use case
- data too large for single machine / Data too large for single machine
- data already on Hadoop / Data already on Hadoop
- algorithms implemented in Mahout / Algorithms implemented in Mahout
Mahout API
- about / Mahout API – a Java program example
- dataset / The dataset
- frequent pattern mining, implementing / Frequent pattern mining with Mahout API
Mahout Scala DSL
- about / Basics of Mahout Scala DSL
- imports / Imports
Mahout Spark
- DRM / Spark Mahout basics
- linear regression / Linear regression with Mahout Spark
Mahout Spark, DRM
- Spark context, initializing / Initializing the Spark context
- optimizer actions, performing / Optimizer actions
- computational actions / Computational actions
- caching, in Spark's block manager / Caching in Spark's block manager
Mahout trunk
- URL, for latest version / Configuring Spark with Mahout
Manhattan distance measure
- about / Manhattan distance measure
manual feature construction
- about / Feature engineering
MapReduce
- limitations / Moving beyond MapReduce
mathematical transformations
- about / Mathematical transformations
matrix
- about / Matrix
- initializing / Initializing the matrix
- elements, accessing / Accessing elements of a matrix
- column, setting / Setting the matrix column
- copy by reference / Copy by reference
Maven
- configuring / Configuring Maven
- URL / Configuring Maven
mean absolute error
- about / Mean absolute error
model, training
- bagging / Bagging
- boosting / Boosting
model efficacy
- about / Model efficacy
- classification / Classification
- regression / Regression
- recommendation system / Recommendation system
- clustering / Clustering
model training and validation phase, churn analytics
- logistic regression / Logistic regression
- adaptive logistic regression / Adaptive logistic regression
- random forest / Random forest

N

n-grams
- about / n-grams
normalization
- about / Normalization
normalization, feature
- about / Feature normalization
- row normalization / Row normalization
- column normalization / Column normalization

O

ordinary least square (OLS)
- about / Linear regression with Mahout Spark

P

p-norm
- about / Normalization
parallel execution
- about / Parallel versus in-memory execution mode
- versus in-memory execution / Parallel versus in-memory execution mode
patsy library
- about / Converting to binary variables
Pearson correlation similarity
- about / Pearson correlation similarity
precision
- about / Precision and recall
preferences
- about / Inferring preferences
preprocessing, customer segmentation
- feature extraction / Feature extraction
- clusters, creating with Fuzzy k-means / Creating the clusters using fuzzy k-means
- clustering, with k-means / Clustering using k-means
- evaluation / Evaluation

R

R
- installing / Installing R
- summary statistics, viewing / Summary statistics
- correlation, calculating / Correlation
R-square
- about / R-square
Rand index
- about / The Rand index
recommendation system
- about / Recommendation system
- score difference / Score difference
- precision and recall / Precision and recall
recommender system
- about / Recommender system, Evaluating recommender
- collaborative filtering / Collaborative filtering
- content-based filtering / Content-based filtering
- evaluating / Evaluating recommender
- user-based recommender system / User-based recommender system
- item-based recommender system / Item-based recommender system
recursive feature elimination
- about / Recursive feature elimination
regression
- about / Supervised learning, Regression
- mean absolute error / Mean absolute error
- root mean squared error (RMSE) / Root mean squared error
- R-square / R-square
- adjusted R-square / Adjusted R-square
relative squared error (RSE)
- about / Root mean squared error
rescaling, feature
- about / Rescaling
resilient distributed dataset (RDD)
- about / Apache Spark
ROC curve
- about / ROC curve and AUC
- used, for evaluating classifier / Evaluating classifier using the ROC curve
- area-based accuracy measure / Area-based accuracy measure
- Euclidian distance comparison / Euclidian distance comparison
- example / Example
ROC graphs
- features / Features of ROC graphs
root mean squared error (RSME)
- about / Root mean squared error
row normalization
- about / Row normalization

S

score difference
- about / Score difference
shared variables
- about / Apache Spark
- broadcast variables / Apache Spark
- accumulators / Apache Spark
similarity
- about / Similarity measures
- Pearson correlation similarity / Pearson correlation similarity
- Euclidean distance similarity / Euclidean distance similarity
- computing, without preference value / Computing similarity without a preference value
- Tanimoto coefficient similarity / Tanimoto coefficient similarity
- log-likelihood similarity / Log-likelihood similarity
source code, Mahout
- importing, into Eclipse / Importing the Mahout source code into Eclipse
Spark
- about / Apache Spark
- configuring, with Mahout / Configuring Spark with Mahout
- Mahout Scala DSL / Basics of Mahout Scala DSL
sparse vector
- about / Initializing a vector inline
Squared Euclidean distance measure
- about / Squared Euclidean distance measure
standard generalized markup language (SGML) / Reuter's raw data file
standardization, feature
- about / Standardization
stemming
- about / Stemming
stop words
- removing / Stop word removal
streaming k-means
- command-line options / Streaming k-means
subversion (svn)
- about / Configuring Spark with Mahout
supervised binning
- about / Binning
supervised learning
- about / Supervised learning
- regression / Supervised learning
- classification / Supervised learning
- objective, determining / Determine the objective
- training data, determining / Decide the training data
- training set, creating / Create and clean the training set
- training set, cleaning / Create and clean the training set
- feature extraction / Feature extraction
- model, training / Train the models
- validation / Validation
- evaluation / Evaluation

T

Tanimoto coefficient similarity
- about / Tanimoto coefficient similarity
Tanimoto distance measure
- about / Tanimoto distance measure
term frequency (TF)
- about / Document indexing
text, categorizing
- about / Categorizing text
- dataset / The dataset
- dataset, URL / The dataset
- feature extraction / Feature extraction
- example / The classification job
text, clustering
- about / Clustering text
- dataset / The dataset
- feature extraction / Feature extraction
- example / The clustering job
text, preprocessing
- tokenization / Tokenization
- stop word removal / Stop word removal
- stemming / Stemming
- example / Preprocessing example
text analytics
- about / Text analytics
- VSM / Vector space model
TF-IDF weighting
- about / TF-IDF weighting
threshold-based neighborhood
- about / Threshold-based neighborhood
tokenization
- about / Tokenization
topic modeling
- LDA, using / Topic modeling using LDA
trigrams
- about / n-grams

U

unigram
- about / n-grams
unsupervised binning
- about / Binning
unsupervised learning
- about / Unsupervised learning
- cluster analysis / Cluster analysis
- frequent pattern mining / Frequent pattern mining
user-based recommender system
- about / User-based recommender system
- user neighborhood / User neighborhood
- dataset / The dataset
- URL, for dataset / The dataset
- example / Mahout code example
- recommender, building / Building the recommender
- recommender, evaluating / Evaluating the recommender
user neighborhood
- about / User neighborhood
- fixed size neighborhood / Fixed size neighborhood
- threshold-based neighborhood / Threshold-based neighborhood

V

validation
- about / Validation
- holdout-set validation / Holdout-set validation
- K-fold cross validation / K-fold cross validation
value-based segmentation
- about / Customer segmentation
vector
- about / Vector
- dense vector / Initializing a vector inline
- initializing / Initializing a vector inline
- sparse vector / Initializing a vector inline
- elements, accessing / Accessing elements of a vector
- element values, setting / Setting values of an element
- arithmetic operations, performing / Vector arithmetic
- arithmetic operations, performing with scalar / Vector operations with a scalar
VSM
- text, preprocessing / Preprocessing
- document indexing / Document indexing
- TF-IDF weighting / TF-IDF weighting
- n-grams / n-grams
- normalization / Normalization

W

wrapper-based feature selection
- about / Wrapper-based feature selection
- backward selection / Backward selection
- forward selection / Forward selection
- recursive feature elimination / Recursive feature elimination