Index
A
- Apache Mahout
- about / Open source or commercial, Apache Mahout
- features / Apache Mahout
- setting up / Setting up Apache Mahout
- URL, for downloading latest release / Setting up Apache Mahout
- URL, for information on setup / Setting up Apache Mahout
- working / How Apache Mahout works?
- high-level design / The high-level design
- distribution / The distribution
- reasons, for using / When is it appropriate to use Apache Mahout?
- references / The Apache Mahout references
- with Hadoop / Apache Mahout with Hadoop
- Apache Spark integration, with Mahout
- application manager
- about / The application manager
- application master
- about / The application master
- Area Under the Curve (AUC) / The area under the curve
B
- ball K-Means step
- about / The ball K-Means step
- parameters / The ball K-Means step
- batch processing
- versus stream processing / Batch processing versus stream processing
- Baum Welch Algorithm
- about / The Baum Welch algorithm
- code example / A code example
- parameters / The important parameters
- binary logistic regression / Multinomial logistic regression versus binary logistic regression
- business, machine learning
- about / Business
- market segmentation (clustering) / Market segmentation (clustering)
- stock market predictions (regression) / Stock market predictions (regression)
C
- Canopy clustering
- about / Canopy clustering
- reference link / Canopy clustering
- classification
- versus regression / Classification versus regression
- clustering
- about / Unsupervised learning and clustering
- types / Types of clustering
- hard clustering, versus soft clustering / Hard clustering versus soft clustering
- flat clustering, versus hierarchical clustering / Flat clustering versus hierarchical clustering
- model-based clustering / Model-based clustering
- clustering, applications
- about / Applications of clustering
- computer vision / Computer vision and image processing
- image processing / Computer vision and image processing
- clustering algorithms
- about / Additional clustering algorithms
- Canopy clustering / Canopy clustering
- Fuzzy K-Means / Fuzzy K-Means
- streaming K-Means / Streaming K-Means
- spectral clustering / Spectral clustering
- Dirichlet clustering / Dirichlet clustering
- clustering performance
- optimizing / Optimizing clustering performance
- right features, selecting / Selecting the right features
- right algorithms, selecting / Selecting the right algorithms
- right distance measure, selecting / Selecting the right distance measure
- clusters, evaluating / Evaluating clusters
- initialization of centroids / The initialization of centroids and the number of clusters
- parameters, tuning up / Tuning up parameters
- Decision on Infrastructure / The decision on infrastructure
- cluster visualization
- about / Cluster visualization
- reference link / Cluster visualization
- cold start problem
- collaborative filtering
- versus content-based filtering / Collaborative versus content-based filtering, Collaborative filtering
- about / Collaborative filtering
- collocations
- about / N-grams and collocations
- reference link / N-grams and collocations
- comma-separated values (CSV) / Data models
- commands/scripts
- used, for monitoring Hadoop / Commands/scripts
- components, HDFS
- name node / Managing storage with HDFS
- data node / Managing storage with HDFS
- secondary node / Managing storage with HDFS
- computer aided disease (CAD) / Computer vision and image processing
- configuration files
- *-default.xml / Configuration changes
- *-site.xml / Configuration changes
- reference link / Configuration changes
- confusion matrix / The confusion matrix
- containers
- about / Containers
- content-based filtering
- versus collaborative filtering / Collaborative versus content-based filtering, Collaborative filtering
- about / Content-based filtering
- continuous / Classification versus regression
- Cosine distance / Distance measure
- custom distance measure
- writing / Writing a custom distance measure
D
- D3.js
- about / D3.js
- URL, for tutorials / A visualization example for K-Means clustering
- D3.js JavaScript file
- URL, for downloading / A visualization example for K-Means clustering
- data models, user-based recommenders / Data models
- data node
- about / Managing storage with HDFS
- data nodes
- used, for monitoring Hadoop / Data nodes
- Dirichlet clustering
- about / Dirichlet clustering
- discrete / Classification versus regression
- distance measure
- about / Distance measure
- distributed mode, Hadoop
- setting up / Setting up Mahout in Hadoop distributed mode
- pseudo-distributed mode / Setting up Mahout in Hadoop distributed mode, The pseudo-distributed mode
- fully-distributed mode / Setting up Mahout in Hadoop distributed mode, The fully-distributed mode
- prerequisites / Prerequisites
- Hadoop user, creating / Creating a Hadoop user
- passwordless SSH configuration, enabling / Passwordless SSH configuration
- Distributed Row Matrix (DRM) / Why is Mahout shifting from Hadoop MapReduce to Spark?
- distribution, Apache Mahout / The distribution
E
- eigenvectors / Spectral clustering
- Euclidean distance / Distance measure
- evaluation techniques, user-based recommenders
- about / Evaluation techniques
- IR-based method (precision/recall) / The IR-based method (precision/recall)
- example script, linear regression with Apache Spark
- about / An example script
- distributed row matrix (DRM) / Distributed row matrix
- code explanation / An explanation of the code
- drmParallelize / An explanation of the code
- dense / An explanation of the code
- drmData.collect / An explanation of the code
- t() operation / An explanation of the code
- solve / An explanation of the code
F
- Fast-moving Consumer Goods (FMCG) / Market segmentation (clustering)
- flat clustering
- versus hierarchical clustering / Flat clustering versus hierarchical clustering
- fsimage file
- about / Managing storage with HDFS
- fully-distributed mode, Hadoop
- about / The fully-distributed mode
- prerequisites / Prerequisites
- host file, configuration / Host file configuration
- Hadoop configuration changes / Hadoop configuration changes
- DFS filesystem, formatting / Formatting the DFS filesystem
- servers, starting / Starting servers
- Mahout, setting up / Setting up Mahout with Hadoop's fully-distributed mode
- Fuzzy K-Means algorithm
- about / Fuzzy K-Means
- reference link / Fuzzy K-Means
G
- Gradient Descent (GD) / Minimizing the cost function
H
- H2O
- in-memory data processing / In-memory data processing with Spark and H2O
- Hadoop
- used, with Apache Mahout / Apache Mahout with Hadoop
- YARN / YARN with MapReduce 2.0
- storage, managing with HDFS / Managing storage with HDFS
- setting up / Setting up Hadoop
- URL / Setting up Mahout in Hadoop distributed mode
- monitoring / Monitoring Hadoop
- monitoring, with commands/scripts / Commands/scripts
- monitoring, with data nodes / Data nodes
- monitoring, with node managers / Node managers
- monitoring, with Web UIs / Web UIs
- troubleshooting / Troubleshooting Hadoop
- optimization tips / Optimization tips
- Hadoop, setting up with Mahout
- in local mode / Setting up Mahout in local mode
- in distributed mode / Setting up Mahout in Hadoop distributed mode
- Hadoop application
- life cycle / The life cycle of a Hadoop application
- Hadoop Distributed File System (HDFS) / Problems with Hadoop MapReduce
- Hadoop MapReduce
- issues / Problems with Hadoop MapReduce
- Hadoop MapReduce, to Spark
- shifting, reason / Why is Mahout shifting from Hadoop MapReduce to Spark?
- hard clustering
- versus soft clustering / Hard clustering versus soft clustering
- HDFS
- about / Managing storage with HDFS
- components / Managing storage with HDFS
- HDFS (Data Storage)
- about / Apache Mahout with Hadoop
- health care, machine learning
- about / Health care
- mammogram cancer tissue detection, using / Using a mammogram for cancer tissue detection
- Hidden Markov Model (HMM)
- about / Hidden Markov Model
- real-world example / A real-world example – developing a POS tagger using HMM supervised learning
- for POS tagging / HMM for POS tagging
- hidden states / HMM for POS tagging
- observed states / HMM for POS tagging
- transition matrix / HMM for POS tagging
- emission matrix / HMM for POS tagging
- implementing, in Apache Mahout / HMM implementation in Apache Mahout
- supervised learning / HMM supervised learning
- high-level design, Apache Mahout / The high-level design
- HMM, for POS tagging
- about / HMM for POS tagging
- hybrid filtering
- about / Hybrid filtering
I
- in-memory data processing, H2O / In-memory data processing with Spark and H2O
- in-memory data processing, Spark / In-memory data processing with Spark and H2O
- inaccurate recommendation results
- issues, addressing with / Addressing the issues with inaccurate recommendation results
- information retrieval, machine learning
- about / Information retrieval
- IR-based method (precision/recall)
- issues
- addressing, with inaccurate recommendation results / Addressing the issues with inaccurate recommendation results
- item-based recommenders
- about / Item-based recommenders
- with Spark / Item-based recommenders with Spark
J
- Java programming
- used, for running K-Means / Running K-Means using Java programming
K
- K-Means
- running, with Java programming / Running K-Means using Java programming
- data, preparing / Data preparation
- parameters / Understanding important parameters
- K-Means clustering
- about / K-Means clustering
- implementing / Getting your hands dirty!
- with MapReduce / K-Means clustering with MapReduce
- K-means clustering
- visualization example / A visualization example for K-Means clustering
L
- linear regression, with Apache Spark
- about / Linear regression with Apache Spark
- working / How does linear regression work?
- real-world example / A real-world example
- with one variable and multiple variables / Linear regression with one variable and multiple variables
- Apache Spark integration / The integration of Apache Spark
- Apache Spark, setting up with Apache Mahout / Setting up Apache Spark with Apache Mahout
- example script / An example script
- Mahout references / Mahout references
- bias-variance trade-off / The bias-variance trade-off
- over-fitting, avoiding / How to avoid over-fitting and under-fitting
- under-fitting, avoiding / How to avoid over-fitting and under-fitting
- local mode, Hadoop
- setting up / Setting up Mahout in local mode
- prerequisites / Prerequisites
- Java, installation / Java installation
- logistic regression, with SGD
- about / Logistic regression with SGD
- applying / Logistic regression with SGD
- logistic functions / Logistic functions
- cost function, minimizing / Minimizing the cost function
- binary logistic regression / Multinomial logistic regression versus binary logistic regression
- multinomial logistic regression / Multinomial logistic regression versus binary logistic regression
- real-world example / A real-world example
- example script / An example script
- testing / Testing and evaluation
- evaluating / Testing and evaluation
- confusion matrix / The confusion matrix
- Area Under the Curve (AUC) / The area under the curve
- Lucene
- text, preprocessing with / Preprocessing text with Lucene
M
- machine learning
- about / Machine learning in a nutshell
- URL, for course / Machine learning in a nutshell
- features / Features
- supervised learning, versus unsupervised learning / Supervised learning versus unsupervised learning
- history / The story so far
- visualization, significance / The significance of visualization in machine learning
- machine learning applications
- about / Machine learning applications
- information retrieval / Information retrieval
- business / Business
- health care / Health care
- machine learning libraries
- about / Machine learning libraries
- open source / Open source or commercial
- commercial / Open source or commercial
- scalability / Scalability
- language used / Languages used
- algorithm support / Algorithm support
- batch processing, versus stream processing / Batch processing versus stream processing
- Mahout
- setting up, with Hadoop's fully-distributed mode / Setting up Mahout with Hadoop's fully-distributed mode
- Mallet
- about / Open source or commercial
- mammogram cancer tissue detection
- Manhattan distance / Distance measure
- map function / The map function
- MapReduce
- about / The distribution
- MapReduce, for machine learning
- reference link / Apache Mahout
- MapReduce, in Apache Mahout / MapReduce in Apache Mahout
- MapReduce, K-Means clustering / K-Means clustering with MapReduce
- MapReduce 2.0
- about / YARN with MapReduce 2.0
- market segmentation (clustering) / Market segmentation (clustering)
- MATLAB
- about / Open source or commercial
- matrix factorization based recommenders
- about / Matrix factorization-based recommenders
- alternative least squares / Alternative least squares
- measures
- Kappa statistic / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- reliability / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- precision / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- recall / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- F1 measure / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- MLib
- about / Open source or commercial
- model-based clustering / Model-based clustering
- model-based prediction
- about / Model-based prediction
- Naïve Bayes example / Model-based prediction
- movie recommendations
- real world example / A real-world example – movie recommendations
- multinomial logistic regression / Multinomial logistic regression versus binary logistic regression
N
- 20 newsgroups dataset
- N-grams
- about / N-grams and collocations
- Naive Bayes algorithm
- Markov chain / The Markov chain
- Named Entity Recognition (NER) / POS tagging
- name node
- about / Managing storage with HDFS
- Natural Language Processing (NLP) tasks / POS tagging
- Naïve Bayes algorithm
- about / The Naïve Bayes algorithm
- Bayes theorem / The Bayes theorem
- text classification / Text classification
- Naïve assumption / Naïve assumption and its pros and cons in text classification
- improvements, by Apache Mahout / Improvements that Apache Mahout has made to the Naïve Bayes classification
- text classification coding example / A text classification coding example using the 20 newsgroups' example
- nearest neighbour algorithm / The neighborhood
- neighborhood algorithm, user-based recommenders / The neighborhood
- nearest neighbour algorithm / The neighborhood
- ThresholdUserNeighborhood / The neighborhood
- node manager
- about / A node manager
- node managers
- used, for monitoring Hadoop / Node managers
O
- Online Gradient Descent / Minimizing the cost function
- OpenCV
- about / Open source or commercial
P
- parameters
- org.apache.hadoop.fs.Path / Understanding important parameters
- org.apache.hadoop.conf.Configuration / Understanding important parameters
- org.apache.mahout.common.distance.DistanceMeasure / Understanding important parameters
- K / Understanding important parameters
- convergenceDelta / Understanding important parameters
- maxIterations / Understanding important parameters
- runClustering / Understanding important parameters
- runSequential / Understanding important parameters
- Part Of Speech (POS) tagging
- about / POS tagging
- predictive analytics techniques
- about / Predictive analytics' techniques
- regression-based prediction / Regression-based prediction
- model-based prediction / Model-based prediction
- tree-based prediction / Tree-based prediction
- predictor variables
- pseudo-distributed mode, Hadoop
- about / The pseudo-distributed mode
- configuration changes / Configuration changes
- DFS filesystem, formatting / Formatting the DFS filesystem
- servers, starting / Starting the servers
R
- real-world example, linear regression with Apache Spark
- about / A real-world example
- impact of smoking on mortality and diseases / The impact of smoking on mortality and different diseases
- Receiver Operating Characteristic (ROC) / The area under the curve
- recommenders / Recommenders
- reduce function / The reduce function
- regression
- versus classification / Classification versus regression
- regression-based prediction
- about / Regression-based prediction
- linear regression / Regression-based prediction
- Stochastic Gradient Descent (SGD) example / Regression-based prediction
- logistic regression / Regression-based prediction
- resource manager
- about / The resource manager
S
- Scalable Vector Graphics (SVG)
- about / D3.js
- secondary node
- about / Managing storage with HDFS
- similarity measure, user-based recommenders / The similarity measure
- similarity measures
- EuclideanDistanceSimilarity / The similarity measure
- TanimotoCoefficientSimilarity / The similarity measure
- LogLikelihoodSimilarity / The similarity measure
- SpearmanCorrelationSimilarity / The similarity measure
- UncenteredCosineSimilarity / The similarity measure
- Singular Value Decomposition (SVD)
- using / Singular value decomposition
- usage tips and tricks / Algorithm usage tips and tricks
- socioeconomic status
- soft clustering
- versus hard clustering / Hard clustering versus soft clustering
- Spark
- in-memory data processing / In-memory data processing with Spark and H2O
- item-based recommenders / Item-based recommenders with Spark
- spectral clustering algorithm
- about / Spectral clustering
- reference link / Spectral clustering
- Squared Euclidean distance / Distance measure
- stock market predictions (regression) / Stock market predictions (regression)
- streaming K-Means
- about / Streaming K-Means
- steps / The streaming step
- ball K-Means step / The ball K-Means step
- stream processing
- versus batch processing / Batch processing versus stream processing
- subcomponents, YARN
- resource manager / The resource manager
- application manager / The application manager
- node manager / A node manager
- application master / The application master
- containers / Containers
- supervised learning
- versus unsupervised learning / Supervised learning versus unsupervised learning
- about / Supervised learning versus unsupervised learning, Supervised learning
- target variable / Target variables and predictor variables
- predictor variables / Target variables and predictor variables
- supervised learning, HMM
- about / HMM supervised learning
- nrOfHiddenStates parameter / The important parameters
- nrOfOutputStates parameter / The important parameters
- hiddenSequences parameter / The important parameters
- observedSequences parameter / The important parameters
- pseudoCount parameter / The important parameters
- returns / Returns
T
- Tanimoto distance / Distance measure
- target variables
- Term Frequency (TF) / The vector space model and TF-IDF
- text
- preprocessing, with Lucene / Preprocessing text with Lucene
- text classification, using Naïve Bayes
- MapReduce implementation, with Hadoop / Text classification using Naïve Bayes – a MapReduce implementation with Hadoop
- Spark implementation / Text classification using Naïve Bayes – the Spark implementation
- text classification coding example, Naïve Bayes algorithm
- about / A text classification coding example using the 20 newsgroups' example
- 20 newsgroups dataset / Understand the 20 newsgroups' dataset
- text clustering
- about / Text clustering
- vector space model / The vector space model and TF-IDF
- N-grams / N-grams and collocations
- collocations / N-grams and collocations
- text clustering, with K-Means clustering / Text clustering with the K-Means algorithm
- TF-IDF / The vector space model and TF-IDF
- ThresholdUserNeighborhood / The neighborhood
- topic modeling
- about / Topic modeling
- reference link / Topic modeling
- trainlogistic function
- input parameter / An example script
- output parameter / An example script
- target parameter / An example script
- categories parameter / An example script
- predictors parameter / An example script
- types parameter / An example script
- features parameter / An example script
- outcome / An example script
- tree-based prediction
- about / Tree-based prediction
- examples / Tree-based prediction
U
- unsupervised learning
- versus supervised learning / Supervised learning versus unsupervised learning
- about / Supervised learning versus unsupervised learning, Unsupervised learning and clustering
- user-based recommenders
- about / User-based recommenders, Recommenders
- real-world example, on movie recommendation site / A real-world example – movie recommendations
- data models / Data models
- similarity measure / The similarity measure
- neighborhood algorithm / The neighborhood
- evaluation techniques / Evaluation techniques
V
- vector space model / The vector space model and TF-IDF
- visualization, in machine learning
- significance / The significance of visualization in machine learning
- visualization example, K-means clustering
- Viterbi evaluator
- about / The Viterbi evaluator
W
- Web UIs
- used, for monitoring Hadoop / Web UIs
- weighted distance measure / Distance measure
Y
- YARN
- about / Apache Mahout with Hadoop
- with MapReduce 2.0 / YARN with MapReduce 2.0
- subcomponents / YARN with MapReduce 2.0
- Yarn (Data processing)
- about / Apache Mahout with Hadoop