Index
A
- Abalone dataset
- URL / Feature construction
- accumulators
- adjusted R-square
- about / Adjusted R-square
- AUC
- about / ROC curve and AUC, Area-based accuracy measure
- automated feature extraction
- about / Feature engineering
B
- backward selection
- about / Backward selection
- bagging
- behavioral-based segmentation
- about / Customer segmentation
- bias-variance trade-off
- about / Bias-variance trade-off
- bigram
- binarization
- binning
- about / Binning
- unsupervised binning / Binning
- supervised binning / Binning
- boosting
- broadcast variables
C
- canopy clustering
- about / Canopy clustering
- command-line options / Canopy clustering
- categorical features
- about / Categorical features
- categories, merging / Merging categories
- categories, converting to binary variables / Converting to binary variables
- categories, converting to continuous variables / Converting to continuous variables
- centroids
- determining / Deciding the initial centroid
- random points, generating / Random points
- random input points, selecting / Points from the dataset
- partition, by range / Partition by range
- canopy centroids, using / Canopy centroids
- churn analytics
- about / Churn analytics
- data, obtaining / Getting the data
- data exploration / Data exploration
- feature engineering / Feature engineering
- model training and validation phase / Model training and validation
- classification
- about / A classification example, Supervised learning, Classification
- common workflow / A classification example
- confusion matrix / Confusion matrix
- AUC / ROC curve and AUC
- ROC curve / ROC curve and AUC
- cluster analysis
- about / Cluster analysis
- objective / Objective
- feature representation / Feature representation
- clustering algorithm, using / Algorithm for clustering
- stopping criteria / A stopping criteria
- clustering
- about / A clustering example, Clustering
- internal evaluation / The internal evaluation
- external evaluation / The external evaluation
- clustering, using Java code
- example / A Mahout Java example
- k-means, using / k-means
- cluster evaluation / Cluster evaluation
- clustering, using Mahout command line
- example / A Mahout command-line example
- data, obtaining / Getting the data
- URL, for dataset / Getting the data
- data, preprocessing / Preprocessing the data
- k-means / k-means
- canopy clustering / Canopy clustering
- fuzzy k-means / Fuzzy k-means
- streaming k-means / Streaming k-means
- clustering algorithm
- k-means / k-means
- canopy clustering / Canopy clustering
- fuzzy k-means / Fuzzy k-means
- collaborative filtering
- about / Collaborative filtering, Collaborative filtering
- cold start / Cold start
- scalability / Scalability
- sparsity / Sparsity
- similarity / Similarity measures
- recommender systems / Evaluating recommender
- preferences / Inferring preferences
- column normalization
- about / Column normalization
- rescaling / Rescaling
- standardization / Standardization
- command line
- about / Mahout command line
- clustering example / A clustering example
- Reuter's raw data file / Reuter's raw data file
- example, for k-means clustering / Reuter's raw data file
- classification example / A classification example
- command line, Mahout
- extending / Extending the command line of Mahout
- used, for implementing LDA / LDA using the Mahout command line
- confusion matrix
- content-based filtering
- about / Content-based filtering
- continuous features
- about / Continuous features
- binning / Binning
- binarization / Binarization
- feature standardization / Feature standardization
- mathematical transformations / Mathematical transformations
- cosine distance measure
- about / Cosine distance measure
- customer segmentation
- about / Customer segmentation
- value-based segmentation / Customer segmentation
- behavioral-based segmentation / Customer segmentation
- demographic-based segmentation / Customer segmentation
- preprocessing / Preprocessing
D
- data exploration, churn analytics
- R, installing / Installing R
- Davies-Bouldin index
- about / The Davies–Bouldin index
- demographic-based segmentation
- about / Customer segmentation
- dense vector
- about / Initializing a vector inline
- development environment
- setting up / Setting up the development environment
- Maven, configuring / Configuring Maven
- Mahout, configuring / Configuring Mahout
- Eclipse, configuring / Configuring Eclipse with the Maven plugin and Mahout
- dimensionality reduction
- about / Feature engineering, Dimensionality reduction
- distance measure
- about / A notion of similarity and dissimilarity
- Euclidean distance measure / Euclidean distance measure
- squared Euclidean distance measure / Squared Euclidean distance measure
- Manhattan distance measure / Manhattan distance measure
- cosine distance measure / Cosine distance measure
- Tanimoto distance measure / Tanimoto distance measure
- document indexing
- about / Document indexing
- DRM
- about / Basics of Mahout Scala DSL
- Dunn index
E
- Eclipse
- configuring / Configuring Eclipse with the Maven plugin and Mahout
- Mahout source code, importing / Importing the Mahout source code into Eclipse
- embedded feature selection
- about / Embedded feature selection
- Euclidean distance measure
- about / Euclidean distance measure
- Euclidean distance similarity
- about / Euclidean distance similarity
- evaluation
- about / Evaluation
- bias-variance trade-off / Bias-variance trade-off
- function complexity / Function complexity and amount of training data
- training data consideration / Function complexity and amount of training data
- dimensionality, of input space / Dimensionality of the input space
- noise, in data / Noise in data
- external evaluation, clustering
- about / The external evaluation
- Rand index / The Rand index
- F-measure / F-measure
F
- F-measure
- feature
- about / Feature engineering
- feature construction
- about / Feature construction
- categorical features / Categorical features
- continuous features / Continuous features
- feature engineering
- about / Feature engineering
- manual feature construction / Feature engineering
- automated feature extraction / Feature engineering
- feature selection / Feature engineering
- dimensionality reduction / Feature engineering
- feature extraction
- about / Feature extraction
- techniques / Feature extraction
- feature extraction, customer segmentation
- day calls / Day calls
- evening calls / Evening calls
- international calls / International calls
- files, preprocessing / Preprocessing the files
- feature representation
- about / Feature representation
- feature normalization / Feature normalization
- similarity / A notion of similarity and dissimilarity
- dissimilarity / A notion of similarity and dissimilarity
- distance measure / A notion of similarity and dissimilarity
- feature selection
- about / Feature engineering, Feature selection
- filter-based feature selection / Filter-based feature selection
- wrapper-based feature selection / Wrapper-based feature selection
- embedded feature selection / Embedded feature selection
- feature standardization
- about / Feature standardization
- rescaling / Rescaling
- mean standardization / Mean standardization
- scaling / Scaling to unit norm
- feature transformation
- about / Feature transformation derived from the problem domain
- ratios / Ratios
- frequency / Frequency
- aggregate transformations / Aggregate transformations
- normalization / Normalization
- filter-based feature selection
- about / Filter-based feature selection
- fixed size neighborhood
- about / Fixed size neighborhood
- forward selection
- about / Forward selection
- FP-Growth
- about / Frequent pattern mining
- FP Tree
- about / Frequent pattern mining
- building / Building FP Tree
- constructing / Constructing the tree
- frequent patterns, identifying / Identifying frequent patterns from FP Tree
- frequent pattern mining
- about / Frequent pattern mining
- rules, identifying / Measures for identifying interesting rules
- considerations / Things to consider
- FP-Growth / Frequent pattern mining
- FP Tree / Frequent pattern mining
- implementing, with Mahout / Frequent pattern mining with Mahout
- Mahout command line, extending / Extending the command line of Mahout
- data, obtaining / Getting the data
- data description / Data description
- implementing, with Mahout API / Frequent pattern mining with Mahout API
- frequent pattern mining (FPM)
- about / Frequent pattern mining with Mahout
- frequent pattern mining, considerations
- actionable rules / Actionable rules
- association, determining / What association to look for
- frequent pattern mining, rules
- identifying / Measures for identifying interesting rules
- support / Support
- confidence / Confidence
- lift / Lift
- conviction / Conviction
- frequent pattern mining, with Mahout API
- MapReduce execution / MapReduce execution
- linear execution / Linear execution
- results, formatting / Formatting the results and computing metrics
- metrics, computing / Formatting the results and computing metrics
- fuzzy k-means
- about / Fuzzy k-means
- fuzzy factor, deciding / Deciding the fuzzy factor
- command-line options / Fuzzy k-means
H
- Hadoop
- URL, for configuring / Setting up the development environment
- Hadoop Distributed File System (HDFS) / Reuter's raw data file
- holdout-set validation
- about / Holdout-set validation
I
- in-core types
- about / In-core types
- vector / Vector
- matrix / Matrix
- in-memory execution
- about / Parallel versus in-memory execution mode
- versus parallel execution / Parallel versus in-memory execution mode
- installation, R
- inter-cluster distance
- about / The inter-cluster distance
- internal evaluation, clustering
- about / The internal evaluation
- intra-cluster distance / The intra-cluster distance
- inter-cluster distance / The inter-cluster distance
- Davies-Bouldin index / The Davies–Bouldin index
- Dunn index / The Dunn index
- intra-cluster distance
- about / The intra-cluster distance
- item-based recommender system
- about / Item-based recommender system
- example / Mahout code example
- recommender, building / Building the recommender
- recommender, evaluating / Evaluating the recommender
K
- K-fold cross validation
- about / K-fold cross validation
- k-means
- about / k-means
- number of clusters, determining / Deciding the number of clusters
- initial centroid, determining / Deciding the initial centroid
- advantages and disadvantages / Advantages and disadvantages
- command-line options / k-means
L
- LDA
- used, for topic modeling / Topic modeling using LDA
- about / Topic modeling using LDA
- implementing, Mahout command line used / LDA using the Mahout command line
- linear regression
- with Mahout Spark / Linear regression with Mahout Spark
- log-likelihood similarity
- about / Log-likelihood similarity
- log-likelihood test
M
- machine learning
- supervised learning / Supervised learning
- unsupervised learning / Unsupervised learning
- recommender system / Recommender system
- model efficacy / Model efficacy
- Mahout
- advantages / Why Mahout
- use case / When Mahout
- development environment, setting up / Setting up the development environment
- configuring / Configuring Mahout
- URL / Configuring Mahout, The classification job
- source code, importing into Eclipse / Importing the Mahout source code into Eclipse
- frequent pattern mining, implementing / Frequent pattern mining with Mahout
- Spark, configuring / Configuring Spark with Mahout
- Mahout, advantages
- simple techniques / Simple techniques and more data is better
- better data collection / Simple techniques and more data is better
- sampling / Sampling is difficult
- license / Community and license
- community / Community and license
- Mahout, use case
- data too large for single machine / Data too large for single machine
- data already on Hadoop / Data already on Hadoop
- algorithms implemented in Mahout / Algorithms implemented in Mahout
- Mahout API
- about / Mahout API – a Java program example
- dataset / The dataset
- frequent pattern mining, implementing / Frequent pattern mining with Mahout API
- Mahout Scala DSL
- about / Basics of Mahout Scala DSL
- imports / Imports
- Mahout Spark
- DRM / Spark Mahout basics
- linear regression / Linear regression with Mahout Spark
- Mahout Spark, DRM
- Spark context, initializing / Initializing the Spark context
- optimizer actions, performing / Optimizer actions
- computational actions / Computational actions
- caching, in Spark's block manager / Caching in Spark's block manager
- Mahout trunk
- URL, for latest version / Configuring Spark with Mahout
- Manhattan distance measure
- about / Manhattan distance measure
- manual feature construction
- about / Feature engineering
- MapReduce
- limitations / Moving beyond MapReduce
- mathematical transformations
- about / Mathematical transformations
- matrix
- about / Matrix
- initializing / Initializing the matrix
- elements, accessing / Accessing elements of a matrix
- column, setting / Setting the matrix column
- copy by reference / Copy by reference
- Maven
- configuring / Configuring Maven
- URL / Configuring Maven
- mean absolute error
- about / Mean absolute error
- model, training
- bagging / Bagging
- boosting / Boosting
- model efficacy
- about / Model efficacy
- classification / Classification
- regression / Regression
- recommendation system / Recommendation system
- clustering / Clustering
- model training and validation phase, churn analytics
- logistic regression / Logistic regression
- adaptive logistic regression / Adaptive logistic regression
- random forest / Random forest
N
- n-grams
- normalization
- normalization, feature
- about / Feature normalization
- row normalization / Row normalization
- column normalization / Column normalization
O
- ordinary least square (OLS)
- about / Linear regression with Mahout Spark
P
- p-norm
- parallel execution
- about / Parallel versus in-memory execution mode
- versus in-memory execution / Parallel versus in-memory execution mode
- patsy library
- about / Converting to binary variables
- Pearson correlation similarity
- about / Pearson correlation similarity
- precision
- about / Precision and recall
- preferences
- about / Inferring preferences
- preprocessing, customer segmentation
- feature extraction / Feature extraction
- clusters, creating with Fuzzy k-means / Creating the clusters using fuzzy k-means
- clustering, with k-means / Clustering using k-means
- evaluation / Evaluation
R
- R
- installing / Installing R
- summary statistics, viewing / Summary statistics
- correlation, calculating / Correlation
- R-square
- Rand index
- recommendation system
- about / Recommendation system
- score difference / Score difference
- precision and recall / Precision and recall
- recommender system
- about / Recommender system, Evaluating recommender
- collaborative filtering / Collaborative filtering
- content-based filtering / Content-based filtering
- evaluating / Evaluating recommender
- user-based recommender system / User-based recommender system
- item-based recommender system / Item-based recommender system
- recursive feature elimination
- about / Recursive feature elimination
- regression
- about / Supervised learning, Regression
- mean absolute error / Mean absolute error
- root mean squared error (RMSE) / Root mean squared error
- R-square / R-square
- adjusted R-square / Adjusted R-square
- relative squared error (RSE)
- about / Root mean squared error
- rescaling, feature
- resilient distributed dataset (RDD)
- ROC curve
- about / ROC curve and AUC
- used, for evaluating classifier / Evaluating classifier using the ROC curve
- area-based accuracy measure / Area-based accuracy measure
- Euclidian distance comparison / Euclidian distance comparison
- example / Example
- ROC graphs
- features / Features of ROC graphs
- root mean squared error (RSME)
- about / Root mean squared error
- row normalization
- about / Row normalization
S
- score difference
- shared variables
- about / Apache Spark
- broadcast variables / Apache Spark
- accumulators / Apache Spark
- similarity
- about / Similarity measures
- Pearson correlation similarity / Pearson correlation similarity
- Euclidean distance similarity / Euclidean distance similarity
- computing, without preference value / Computing similarity without a preference value
- Tanimoto coefficient similarity / Tanimoto coefficient similarity
- log-likelihood similarity / Log-likelihood similarity
- source code, Mahout
- importing, into Eclipse / Importing the Mahout source code into Eclipse
- Spark
- about / Apache Spark
- configuring, with Mahout / Configuring Spark with Mahout
- Mahout Scala DSL / Basics of Mahout Scala DSL
- sparse vector
- about / Initializing a vector inline
- Squared Euclidean distance measure
- about / Squared Euclidean distance measure
- standard generalized markup language (SGML) / Reuter's raw data file
- standardization, feature
- stemming
- stop words
- removing / Stop word removal
- streaming k-means
- command-line options / Streaming k-means
- subversion (svn)
- about / Configuring Spark with Mahout
- supervised binning
- supervised learning
- about / Supervised learning
- regression / Supervised learning
- classification / Supervised learning
- objective, determining / Determine the objective
- training data, determining / Decide the training data
- training set, creating / Create and clean the training set
- training set, cleaning / Create and clean the training set
- feature extraction / Feature extraction
- model, training / Train the models
- validation / Validation
- evaluation / Evaluation
T
- Tanimoto coefficient similarity
- about / Tanimoto coefficient similarity
- Tanimoto distance measure
- about / Tanimoto distance measure
- term frequency (TF)
- about / Document indexing
- text, categorizing
- about / Categorizing text
- dataset / The dataset
- dataset, URL / The dataset
- feature extraction / Feature extraction
- example / The classification job
- text, clustering
- about / Clustering text
- dataset / The dataset
- feature extraction / Feature extraction
- example / The clustering job
- text, preprocessing
- tokenization / Tokenization
- stop word removal / Stop word removal
- stemming / Stemming
- example / Preprocessing example
- text analytics
- about / Text analytics
- VSM / Vector space model
- TF-IDF weighting
- threshold-based neighborhood
- about / Threshold-based neighborhood
- tokenization
- topic modeling
- LDA, using / Topic modeling using LDA
- trigrams
U
- unigram
- unsupervised binning
- unsupervised learning
- about / Unsupervised learning
- cluster analysis / Cluster analysis
- frequent pattern mining / Frequent pattern mining
- user-based recommender system
- about / User-based recommender system
- user neighborhood / User neighborhood
- dataset / The dataset
- URL, for dataset / The dataset
- example / Mahout code example
- recommender, building / Building the recommender
- recommender, evaluating / Evaluating the recommender
- user neighborhood
- about / User neighborhood
- fixed size neighborhood / Fixed size neighborhood
- threshold-based neighborhood / Threshold-based neighborhood
V
- validation
- about / Validation
- holdout-set validation / Holdout-set validation
- K-fold cross validation / K-fold cross validation
- value-based segmentation
- about / Customer segmentation
- vector
- about / Vector
- dense vector / Initializing a vector inline
- initializing / Initializing a vector inline
- sparse vector / Initializing a vector inline
- elements, accessing / Accessing elements of a vector
- element values, setting / Setting values of an element
- arithmetic operations, performing / Vector arithmetic
- arithmetic operations, performing with scalar / Vector operations with a scalar
- VSM
- text, preprocessing / Preprocessing
- document indexing / Document indexing
- TF-IDF weighting / TF-IDF weighting
- n-grams / n-grams
- normalization / Normalization
W
- wrapper-based feature selection
- about / Wrapper-based feature selection
- backward selection / Backward selection
- forward selection / Forward selection
- recursive feature elimination / Recursive feature elimination