Book Image

Learning Apache Mahout

Book Image

Learning Apache Mahout

Overview of this book

Table of Contents (17 chapters)
Learning Apache Mahout
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Free Chapter
1
Introduction to Mahout
9
Case Study – Churn Analytics and Customer Segmentation
Index

Index

A

  • Abalone dataset
    • URL / Feature construction
  • accumulators
    • about / Apache Spark
  • adjusted R-square
    • about / Adjusted R-square
  • AUC
    • about / ROC curve and AUC, Area-based accuracy measure
  • automated feature extraction
    • about / Feature engineering

B

  • backward selection
    • about / Backward selection
  • bagging
    • about / Bagging
  • behavioral-based segmentation
    • about / Customer segmentation
  • bias-variance trade-off
    • about / Bias-variance trade-off
  • bigram
    • about / n-grams
  • binarization
    • about / Binarization
  • binning
    • about / Binning
    • unsupervised binning / Binning
    • supervised binning / Binning
  • boosting
    • about / Boosting
  • broadcast variables
    • about / Apache Spark

C

  • canopy clustering
    • about / Canopy clustering
    • command-line options / Canopy clustering
  • categorical features
    • about / Categorical features
    • categories, merging / Merging categories
    • categories, converting to binary variables / Converting to binary variables
    • categories, converting to continuous variables / Converting to continuous variables
  • centroids
    • determining / Deciding the initial centroid
    • random points, generating / Random points
    • random input points, selecting / Points from the dataset
    • partition, by range / Partition by range
    • canopy centroids, using / Canopy centroids
  • churn analytics
    • about / Churn analytics
    • data, obtaining / Getting the data
    • data exploration / Data exploration
    • feature engineering / Feature engineering
    • model training and validation phase / Model training and validation
  • classification
    • about / A classification example, Supervised learning, Classification
    • common workflow / A classification example
    • confusion matrix / Confusion matrix
    • AUC / ROC curve and AUC
    • ROC curve / ROC curve and AUC
  • cluster analysis
    • about / Cluster analysis
    • objective / Objective
    • feature representation / Feature representation
    • clustering algorithm, using / Algorithm for clustering
    • stopping criteria / A stopping criteria
  • clustering
    • about / A clustering example, Clustering
    • internal evaluation / The internal evaluation
    • external evaluation / The external evaluation
  • clustering, using Java code
    • example / A Mahout Java example
    • k-means, using / k-means
    • cluster evaluation / Cluster evaluation
  • clustering, using Mahout command line
    • example / A Mahout command-line example
    • data, obtaining / Getting the data
    • URL, for dataset / Getting the data
    • data, preprocessing / Preprocessing the data
    • k-means / k-means
    • canopy clustering / Canopy clustering
    • fuzzy k-means / Fuzzy k-means
    • streaming k-means / Streaming k-means
  • clustering algorithm
    • k-means / k-means
    • canopy clustering / Canopy clustering
    • fuzzy k-means / Fuzzy k-means
  • collaborative filtering
    • about / Collaborative filtering, Collaborative filtering
    • cold start / Cold start
    • scalability / Scalability
    • sparsity / Sparsity
    • similarity / Similarity measures
    • recommender systems / Evaluating recommender
    • preferences / Inferring preferences
  • column normalization
    • about / Column normalization
    • rescaling / Rescaling
    • standardization / Standardization
  • command line
    • about / Mahout command line
    • clustering example / A clustering example
    • Reuter's raw data file / Reuter's raw data file
    • example, for k-means clustering / Reuter's raw data file
    • classification example / A classification example
  • command line, Mahout
    • extending / Extending the command line of Mahout
    • used, for implementing LDA / LDA using the Mahout command line
  • confusion matrix
    • about / Confusion matrix
  • content-based filtering
    • about / Content-based filtering
  • continuous features
    • about / Continuous features
    • binning / Binning
    • binarization / Binarization
    • feature standardization / Feature standardization
    • mathematical transformations / Mathematical transformations
  • cosine distance measure
    • about / Cosine distance measure
  • customer segmentation
    • about / Customer segmentation
    • value-based segmentation / Customer segmentation
    • behavioral-based segmentation / Customer segmentation
    • demographic-based segmentation / Customer segmentation
    • preprocessing / Preprocessing

D

  • data exploration, churn analytics
    • R, installing / Installing R
  • Davies-Bouldin index
    • about / The Davies–Bouldin index
  • demographic-based segmentation
    • about / Customer segmentation
  • dense vector
    • about / Initializing a vector inline
  • development environment
    • setting up / Setting up the development environment
    • Maven, configuring / Configuring Maven
    • Mahout, configuring / Configuring Mahout
    • Eclipse, configuring / Configuring Eclipse with the Maven plugin and Mahout
  • dimensionality reduction
    • about / Feature engineering, Dimensionality reduction
  • distance measure
    • about / A notion of similarity and dissimilarity
    • Euclidean distance measure / Euclidean distance measure
    • squared Euclidean distance measure / Squared Euclidean distance measure
    • Manhattan distance measure / Manhattan distance measure
    • cosine distance measure / Cosine distance measure
    • Tanimoto distance measure / Tanimoto distance measure
  • document indexing
    • about / Document indexing
  • DRM
    • about / Basics of Mahout Scala DSL
  • Dunn index
    • about / The Dunn index

E

  • Eclipse
    • configuring / Configuring Eclipse with the Maven plugin and Mahout
    • Mahout source code, importing / Importing the Mahout source code into Eclipse
  • embedded feature selection
    • about / Embedded feature selection
  • Euclidean distance measure
    • about / Euclidean distance measure
  • Euclidean distance similarity
    • about / Euclidean distance similarity
  • evaluation
    • about / Evaluation
    • bias-variance trade-off / Bias-variance trade-off
    • function complexity / Function complexity and amount of training data
    • training data consideration / Function complexity and amount of training data
    • dimensionality, of input space / Dimensionality of the input space
    • noise, in data / Noise in data
  • external evaluation, clustering
    • about / The external evaluation
    • Rand index / The Rand index
    • F-measure / F-measure

F

  • F-measure
    • about / F-measure
  • feature
    • about / Feature engineering
  • feature construction
    • about / Feature construction
    • categorical features / Categorical features
    • continuous features / Continuous features
  • feature engineering
    • about / Feature engineering
    • manual feature construction / Feature engineering
    • automated feature extraction / Feature engineering
    • feature selection / Feature engineering
    • dimensionality reduction / Feature engineering
  • feature extraction
    • about / Feature extraction
    • techniques / Feature extraction
  • feature extraction, customer segmentation
    • day calls / Day calls
    • evening calls / Evening calls
    • international calls / International calls
    • files, preprocessing / Preprocessing the files
  • feature representation
    • about / Feature representation
    • feature normalization / Feature normalization
    • similarity / A notion of similarity and dissimilarity
    • dissimilarity / A notion of similarity and dissimilarity
    • distance measure / A notion of similarity and dissimilarity
  • feature selection
    • about / Feature engineering, Feature selection
    • filter-based feature selection / Filter-based feature selection
    • wrapper-based feature selection / Wrapper-based feature selection
    • embedded feature selection / Embedded feature selection
  • feature standardization
    • about / Feature standardization
    • rescaling / Rescaling
    • mean standardization / Mean standardization
    • scaling / Scaling to unit norm
  • feature transformation
    • about / Feature transformation derived from the problem domain
    • ratios / Ratios
    • frequency / Frequency
    • aggregate transformations / Aggregate transformations
    • normalization / Normalization
  • filter-based feature selection
    • about / Filter-based feature selection
  • fixed size neighborhood
    • about / Fixed size neighborhood
  • forward selection
    • about / Forward selection
  • FP-Growth
    • about / Frequent pattern mining
  • FP Tree
    • about / Frequent pattern mining
    • building / Building FP Tree
    • constructing / Constructing the tree
    • frequent patterns, identifying / Identifying frequent patterns from FP Tree
  • frequent pattern mining
    • about / Frequent pattern mining
    • rules, identifying / Measures for identifying interesting rules
    • considerations / Things to consider
    • FP-Growth / Frequent pattern mining
    • FP Tree / Frequent pattern mining
    • implementing, with Mahout / Frequent pattern mining with Mahout
    • Mahout command line, extending / Extending the command line of Mahout
    • data, obtaining / Getting the data
    • data description / Data description
    • implementing, with Mahout API / Frequent pattern mining with Mahout API
  • frequent pattern mining (FPM)
    • about / Frequent pattern mining with Mahout
  • frequent pattern mining, considerations
    • actionable rules / Actionable rules
    • association, determining / What association to look for
  • frequent pattern mining, rules
    • identifying / Measures for identifying interesting rules
    • support / Support
    • confidence / Confidence
    • lift / Lift
    • conviction / Conviction
  • frequent pattern mining, with Mahout API
    • MapReduce execution / MapReduce execution
    • linear execution / Linear execution
    • results, formatting / Formatting the results and computing metrics
    • metrics, computing / Formatting the results and computing metrics
  • fuzzy k-means
    • about / Fuzzy k-means
    • fuzzy factor, deciding / Deciding the fuzzy factor
    • command-line options / Fuzzy k-means

H

  • Hadoop
    • URL, for configuring / Setting up the development environment
  • Hadoop Distributed File System (HDFS) / Reuter's raw data file
  • holdout-set validation
    • about / Holdout-set validation

I

  • in-core types
    • about / In-core types
    • vector / Vector
    • matrix / Matrix
  • in-memory execution
    • about / Parallel versus in-memory execution mode
    • versus parallel execution / Parallel versus in-memory execution mode
  • installation, R
    • about / Installing R
  • inter-cluster distance
    • about / The inter-cluster distance
  • internal evaluation, clustering
    • about / The internal evaluation
    • intra-cluster distance / The intra-cluster distance
    • inter-cluster distance / The inter-cluster distance
    • Davies-Bouldin index / The Davies–Bouldin index
    • Dunn index / The Dunn index
  • intra-cluster distance
    • about / The intra-cluster distance
  • item-based recommender system
    • about / Item-based recommender system
    • example / Mahout code example
    • recommender, building / Building the recommender
    • recommender, evaluating / Evaluating the recommender

K

  • K-fold cross validation
    • about / K-fold cross validation
  • k-means
    • about / k-means
    • number of clusters, determining / Deciding the number of clusters
    • initial centroid, determining / Deciding the initial centroid
    • advantages and disadvantages / Advantages and disadvantages
    • command-line options / k-means

L

  • LDA
    • used, for topic modeling / Topic modeling using LDA
    • about / Topic modeling using LDA
    • implementing, Mahout command line used / LDA using the Mahout command line
  • linear regression
    • with Mahout Spark / Linear regression with Mahout Spark
  • log-likelihood similarity
    • about / Log-likelihood similarity
  • log-likelihood test
    • about / n-grams

M

  • machine learning
    • supervised learning / Supervised learning
    • unsupervised learning / Unsupervised learning
    • recommender system / Recommender system
    • model efficacy / Model efficacy
  • Mahout
    • advantages / Why Mahout
    • use case / When Mahout
    • development environment, setting up / Setting up the development environment
    • configuring / Configuring Mahout
    • URL / Configuring Mahout, The classification job
    • source code, importing into Eclipse / Importing the Mahout source code into Eclipse
    • frequent pattern mining, implementing / Frequent pattern mining with Mahout
    • Spark, configuring / Configuring Spark with Mahout
  • Mahout, advantages
    • simple techniques / Simple techniques and more data is better
    • better data collection / Simple techniques and more data is better
    • sampling / Sampling is difficult
    • license / Community and license
    • community / Community and license
  • Mahout, use case
    • data too large for single machine / Data too large for single machine
    • data already on Hadoop / Data already on Hadoop
    • algorithms implemented in Mahout / Algorithms implemented in Mahout
  • Mahout API
    • about / Mahout API – a Java program example
    • dataset / The dataset
    • frequent pattern mining, implementing / Frequent pattern mining with Mahout API
  • Mahout Scala DSL
    • about / Basics of Mahout Scala DSL
    • imports / Imports
  • Mahout Spark
    • DRM / Spark Mahout basics
    • linear regression / Linear regression with Mahout Spark
  • Mahout Spark, DRM
    • Spark context, initializing / Initializing the Spark context
    • optimizer actions, performing / Optimizer actions
    • computational actions / Computational actions
    • caching, in Spark's block manager / Caching in Spark's block manager
  • Mahout trunk
    • URL, for latest version / Configuring Spark with Mahout
  • Manhattan distance measure
    • about / Manhattan distance measure
  • manual feature construction
    • about / Feature engineering
  • MapReduce
    • limitations / Moving beyond MapReduce
  • mathematical transformations
    • about / Mathematical transformations
  • matrix
    • about / Matrix
    • initializing / Initializing the matrix
    • elements, accessing / Accessing elements of a matrix
    • column, setting / Setting the matrix column
    • copy by reference / Copy by reference
  • Maven
    • configuring / Configuring Maven
    • URL / Configuring Maven
  • mean absolute error
    • about / Mean absolute error
  • model, training
    • bagging / Bagging
    • boosting / Boosting
  • model efficacy
    • about / Model efficacy
    • classification / Classification
    • regression / Regression
    • recommendation system / Recommendation system
    • clustering / Clustering
  • model training and validation phase, churn analytics
    • logistic regression / Logistic regression
    • adaptive logistic regression / Adaptive logistic regression
    • random forest / Random forest

N

  • n-grams
    • about / n-grams
  • normalization
    • about / Normalization
  • normalization, feature
    • about / Feature normalization
    • row normalization / Row normalization
    • column normalization / Column normalization

O

  • ordinary least square (OLS)
    • about / Linear regression with Mahout Spark

P

  • p-norm
    • about / Normalization
  • parallel execution
    • about / Parallel versus in-memory execution mode
    • versus in-memory execution / Parallel versus in-memory execution mode
  • patsy library
    • about / Converting to binary variables
  • Pearson correlation similarity
    • about / Pearson correlation similarity
  • precision
    • about / Precision and recall
  • preferences
    • about / Inferring preferences
  • preprocessing, customer segmentation
    • feature extraction / Feature extraction
    • clusters, creating with Fuzzy k-means / Creating the clusters using fuzzy k-means
    • clustering, with k-means / Clustering using k-means
    • evaluation / Evaluation

R

  • R
    • installing / Installing R
    • summary statistics, viewing / Summary statistics
    • correlation, calculating / Correlation
  • R-square
    • about / R-square
  • Rand index
    • about / The Rand index
  • recommendation system
    • about / Recommendation system
    • score difference / Score difference
    • precision and recall / Precision and recall
  • recommender system
    • about / Recommender system, Evaluating recommender
    • collaborative filtering / Collaborative filtering
    • content-based filtering / Content-based filtering
    • evaluating / Evaluating recommender
    • user-based recommender system / User-based recommender system
    • item-based recommender system / Item-based recommender system
  • recursive feature elimination
    • about / Recursive feature elimination
  • regression
    • about / Supervised learning, Regression
    • mean absolute error / Mean absolute error
    • root mean squared error (RMSE) / Root mean squared error
    • R-square / R-square
    • adjusted R-square / Adjusted R-square
  • relative squared error (RSE)
    • about / Root mean squared error
  • rescaling, feature
    • about / Rescaling
  • resilient distributed dataset (RDD)
    • about / Apache Spark
  • ROC curve
    • about / ROC curve and AUC
    • used, for evaluating classifier / Evaluating classifier using the ROC curve
    • area-based accuracy measure / Area-based accuracy measure
    • Euclidian distance comparison / Euclidian distance comparison
    • example / Example
  • ROC graphs
    • features / Features of ROC graphs
  • root mean squared error (RSME)
    • about / Root mean squared error
  • row normalization
    • about / Row normalization

S

  • score difference
    • about / Score difference
  • shared variables
    • about / Apache Spark
    • broadcast variables / Apache Spark
    • accumulators / Apache Spark
  • similarity
    • about / Similarity measures
    • Pearson correlation similarity / Pearson correlation similarity
    • Euclidean distance similarity / Euclidean distance similarity
    • computing, without preference value / Computing similarity without a preference value
    • Tanimoto coefficient similarity / Tanimoto coefficient similarity
    • log-likelihood similarity / Log-likelihood similarity
  • source code, Mahout
    • importing, into Eclipse / Importing the Mahout source code into Eclipse
  • Spark
    • about / Apache Spark
    • configuring, with Mahout / Configuring Spark with Mahout
    • Mahout Scala DSL / Basics of Mahout Scala DSL
  • sparse vector
    • about / Initializing a vector inline
  • Squared Euclidean distance measure
    • about / Squared Euclidean distance measure
  • standard generalized markup language (SGML) / Reuter's raw data file
  • standardization, feature
    • about / Standardization
  • stemming
    • about / Stemming
  • stop words
    • removing / Stop word removal
  • streaming k-means
    • command-line options / Streaming k-means
  • subversion (svn)
    • about / Configuring Spark with Mahout
  • supervised binning
    • about / Binning
  • supervised learning
    • about / Supervised learning
    • regression / Supervised learning
    • classification / Supervised learning
    • objective, determining / Determine the objective
    • training data, determining / Decide the training data
    • training set, creating / Create and clean the training set
    • training set, cleaning / Create and clean the training set
    • feature extraction / Feature extraction
    • model, training / Train the models
    • validation / Validation
    • evaluation / Evaluation

T

  • Tanimoto coefficient similarity
    • about / Tanimoto coefficient similarity
  • Tanimoto distance measure
    • about / Tanimoto distance measure
  • term frequency (TF)
    • about / Document indexing
  • text, categorizing
    • about / Categorizing text
    • dataset / The dataset
    • dataset, URL / The dataset
    • feature extraction / Feature extraction
    • example / The classification job
  • text, clustering
    • about / Clustering text
    • dataset / The dataset
    • feature extraction / Feature extraction
    • example / The clustering job
  • text, preprocessing
    • tokenization / Tokenization
    • stop word removal / Stop word removal
    • stemming / Stemming
    • example / Preprocessing example
  • text analytics
    • about / Text analytics
    • VSM / Vector space model
  • TF-IDF weighting
    • about / TF-IDF weighting
  • threshold-based neighborhood
    • about / Threshold-based neighborhood
  • tokenization
    • about / Tokenization
  • topic modeling
    • LDA, using / Topic modeling using LDA
  • trigrams
    • about / n-grams

U

  • unigram
    • about / n-grams
  • unsupervised binning
    • about / Binning
  • unsupervised learning
    • about / Unsupervised learning
    • cluster analysis / Cluster analysis
    • frequent pattern mining / Frequent pattern mining
  • user-based recommender system
    • about / User-based recommender system
    • user neighborhood / User neighborhood
    • dataset / The dataset
    • URL, for dataset / The dataset
    • example / Mahout code example
    • recommender, building / Building the recommender
    • recommender, evaluating / Evaluating the recommender
  • user neighborhood
    • about / User neighborhood
    • fixed size neighborhood / Fixed size neighborhood
    • threshold-based neighborhood / Threshold-based neighborhood

V

  • validation
    • about / Validation
    • holdout-set validation / Holdout-set validation
    • K-fold cross validation / K-fold cross validation
  • value-based segmentation
    • about / Customer segmentation
  • vector
    • about / Vector
    • dense vector / Initializing a vector inline
    • initializing / Initializing a vector inline
    • sparse vector / Initializing a vector inline
    • elements, accessing / Accessing elements of a vector
    • element values, setting / Setting values of an element
    • arithmetic operations, performing / Vector arithmetic
    • arithmetic operations, performing with scalar / Vector operations with a scalar
  • VSM
    • text, preprocessing / Preprocessing
    • document indexing / Document indexing
    • TF-IDF weighting / TF-IDF weighting
    • n-grams / n-grams
    • normalization / Normalization

W

  • wrapper-based feature selection
    • about / Wrapper-based feature selection
    • backward selection / Backward selection
    • forward selection / Forward selection
    • recursive feature elimination / Recursive feature elimination