# Index

## A

- Activation Function / Perceptron
- advanced visualization technique
- about / Advanced visualization technique
- prefuse / Prefuse
- IVTK Graph toolkit / IVTK Graph toolkit

- Alternating Least Square (ALS) / Alternating least square – collaborative filtering
- Apache Kafka
- about / Apache Kafka
- IoT sensors, integration / Apache Kafka
- social media real-time analytics / Apache Kafka
- healthcare analytics / Apache Kafka
- log analytics / Apache Kafka
- risk aggregation, in finance / Apache Kafka

- Apache Spark
- about / Apache Spark
- concepts / Concepts
- transformations / Transformations
- actions / Actions
- Spark Java API / Spark Java API
- samples, Java 8 used / Spark samples using Java 8
- data, loading / Loading data
- data operations / Data operations – cleansing and munging
- data, analyzing / Analyzing data – count, projection, grouping, aggregation, and max/min
- common transformations, on Spark RDDs / Analyzing data – count, projection, grouping, aggregation, and max/min
- actions, on RDDs / Actions on RDDs
- paired RDDs / Paired RDDs
- data, saving / Saving data
- results, collecting / Collecting and printing results
- results, printing / Collecting and printing results
- programs, executing on Hadoop / Executing Spark programs on Hadoop
- subprojects / Apache Spark sub-projects
- machine learning modules / Spark machine learning modules
- Apache Mahout / Mahout – a popular Java ML library
- Deeplearning4j / Deeplearning4j – a deep learning library
- Apriori algorithm, implementation / Implementation of the Apriori algorithm in Apache Spark
- FP-Growth algorithm, executing / Running FP-Growth on Apache Spark

- Apache Spark, machine learning modules
- MLlib Java API / MLlib Java API
- machine learning libraries / Other machine learning libraries

- Apache Spark machine learning API
- about / The new Spark ML API
- machine learning algorithms / The new Spark ML API
- features handling tools / The new Spark ML API
- model selection / The new Spark ML API
- tuning tools / The new Spark ML API
- utility methods / The new Spark ML API

- Apriori algorithm
- implementation, in Apache Spark / Implementation of the Apriori algorithm in Apache Spark
- using / Implementation of the Apriori algorithm in Apache Spark
- disadvantages / Implementation of the Apriori algorithm in Apache Spark

- artificial neural network / Introduction to neural networks

## B

- bagging / Bagging
- bag of words / Bag of words
- bar chart
- about / Bar charts
- dataset, creating / Bar charts

- base project setup / Base project setup
- default Kafka configurations, used / Base project setup
- Maven Java project, for Spark Streaming / Base project setup

- bayes theorem / Bayes theorem
- bid data
- Analytical products / Basics of Hadoop – a Java sub-project
- Batch products / Basics of Hadoop – a Java sub-project
- Streamlining / Basics of Hadoop – a Java sub-project
- Machine learning libraries / Basics of Hadoop – a Java sub-project
- NoSQL / Basics of Hadoop – a Java sub-project
- Search / Basics of Hadoop – a Java sub-project

- bidirected graph / Refresher on graphs
- big data
- data analytics on / Why data analytics on big data?
- for data analytics / Big data for analytics
- to bigger pay package, for Java developers / Big data – a bigger pay package for Java developers
- Hadoop, basics / Basics of Hadoop – a Java sub-project

- big data stack
- HDFS / Basics of Hadoop – a Java sub-project
- Spark / Basics of Hadoop – a Java sub-project
- Impala / Basics of Hadoop – a Java sub-project
- MapReduce / Basics of Hadoop – a Java sub-project
- Sqoop / Basics of Hadoop – a Java sub-project
- Oozie / Basics of Hadoop – a Java sub-project
- Flume / Basics of Hadoop – a Java sub-project
- Kafka / Basics of Hadoop – a Java sub-project
- Yarn / Basics of Hadoop – a Java sub-project

- binary classification dataset / What are the feature types that can be extracted from the datasets?
- boosting / Boosting
- bootstrapping / Bagging
- box plots / Box plots

## C

- charts
- used, in big data analytics / Using charts in big data analytics
- for initial data exploration / Using charts in big data analytics
- for data visualization and reporting / Using charts in big data analytics

- clustering
- about / Clustering
- customer segmentation / Clustering
- search engines / Clustering
- data exploration / Clustering
- epidemic breakout zones, finding / Clustering
- biology / Clustering
- news categorization / Clustering
- news, summarization / Clustering
- types / Types of clustering
- hierarchical clustering / Hierarchical clustering
- K-means clustering / K-means clustering
- k-means clustering, bisecting / Bisecting k-means clustering
- for customer segmentation / Clustering for customer segmentation

- clustering algorithm
- changing / Changing the clustering algorithm

- code
- diving / Diving into the code:

- cold start problem / Content-based recommendation systems
- collaborative recommendation systems
- about / Collaborative recommendation systems
- advantages / Advantages
- disadvantages / Disadvantages
- collaborative filtering / Alternating least square – collaborative filtering

- common transformations, on Spark RDDs
- Filter / Analyzing data – count, projection, grouping, aggregation, and max/min
- Map / Analyzing data – count, projection, grouping, aggregation, and max/min
- FlatMap / Analyzing data – count, projection, grouping, aggregation, and max/min
- other transformations / Analyzing data – count, projection, grouping, aggregation, and max/min

- Conditional-FP tree / Efficient market basket analysis using FP-Growth algorithm
- Conditional FP Tree / Efficient market basket analysis using FP-Growth algorithm
- Conditional Pattern / Efficient market basket analysis using FP-Growth algorithm
- Conditional Patterns Base / Efficient market basket analysis using FP-Growth algorithm
- conditional probability / Conditional probability
- content-based recommendation systems
- about / Content-based recommendation systems
- Euclidean Distance / Content-based recommendation systems
- Pearson Correlation / Content-based recommendation systems
- dataset / Dataset
- content-based recommender, on MovieLens dataset / Content-based recommender on MovieLens dataset
- collaborative recommendation systems / Collaborative recommendation systems

- content-based recommender
- on MovieLens dataset / Content-based recommender on MovieLens dataset

- context
- building / Building SparkConf and context

- customer segmentation / Customer segmentation
- clustering / Clustering for customer segmentation

## D

- data
- cleaning / Data cleaning and munging, Cleaning and munging the data
- munging / Data cleaning and munging, Cleaning and munging the data
- unwanted data, filtering / Data cleaning and munging
- missing data, handling / Data cleaning and munging
- incomplete data, handling / Data cleaning and munging
- discarding / Data cleaning and munging
- constant value, filling / Data cleaning and munging
- average value, populating / Data cleaning and munging
- nearest neighbor approach / Data cleaning and munging
- converting, to proper format / Data cleaning and munging
- basic analysis, with Spark SQL / Basic analysis of data with Spark SQL
- parsing / Load and parse data
- loading / Load and parse data
- Spark-SQL way / Analyzing data – the Spark-SQL way
- Spark SQL, for data exploration and analytics / Spark SQL for data exploration and analytics
- Apriori algorithm / Market basket analysis – Apriori algorithm
- Full Apriori algorithm / Full Apriori algorithm
- preparing / Preparing the data
- formatting / Formatting the data
- storing / Storing the data

- data analytics
- on big data / Why data analytics on big data?
- distributed computing, on Hadoop / Distributed computing on Hadoop
- HDFS concepts / HDFS concepts
- Apache Spark / Apache Spark

- data exploration
- of text data / Data exploration of text data

- dataframe / Dataframe and datasets
- DataNode / Main components of HDFS
- dataset / Dataset, Dataset
- URL, for downloading / All India seasonal and annual average temperature series dataset
- fields / All India seasonal and annual average temperature series dataset
- data / All India seasonal and annual average temperature series dataset
- reference link / Predicting house prices using linear regression
- data, munging / Data cleaning and munging
- full batch approach / Accuracy of multi-layer perceptrons
- partial batch approach / Accuracy of multi-layer perceptrons

- dataset, linear regression
- data, cleaning / Data cleaning and munging
- exploring / Exploring the dataset
- number of rows / Exploring the dataset
- average price per zipcode, sorting by highest on top / Exploring the dataset
- linear regression model, executing / Running and testing the linear regression model
- linear regression model, testing / Running and testing the linear regression model

- dataset, logistic regression
- data, cleaning / Data cleaning and munging
- data, munging / Data cleaning and munging
- data, missing / Data cleaning and munging
- categorical data / Data cleaning and munging
- data exploration / Data exploration
- executing / Running and testing the logistic regression model
- testing / Running and testing the logistic regression model

- dataset object / Training and testing the model
- datasets / Datasets, Dataframe and datasets
- datasets splitting
- features selected / Choosing the best features for splitting the datasets
- Gini Impurity / Choosing the best features for splitting the datasets

- data transfer techniques
- data visualization
- with Java JFreeChart / Data visualization with Java JFreeChart
- charts, used in big data analytics / Using charts in big data analytics

- decision tree
- about / What is a decision tree?
- for classification / What is a decision tree?
- for regression / What is a decision tree?
- building / Building a decision tree
- datasets splitting, features selected / Choosing the best features for splitting the datasets
- advantages / Advantages of using decision trees
- disadvantages / Disadvantages of using decision trees
- dataset / Dataset
- data exploration / Data exploration
- data, cleaning / Cleaning and munging the data
- data, munging / Cleaning and munging the data
- model, training / Training and testing the model
- model, testing / Training and testing the model

- deep learning
- about / Deep learning
- advantages / Advantages and use cases of deep learning
- use cases / Advantages and use cases of deep learning
- no feature engineering required / Advantages and use cases of deep learning
- accuracy / Advantages and use cases of deep learning
- information / More information on deep learning

- deeplearning4j / Deeplearning4j
- references / Deeplearning4j

- Deeplearning4j
- about / Deeplearning4j – a deep learning library
- data, compressing / Compressing data
- Avro / Avro and Parquet
- Parquet / Avro and Parquet

- distributed computing
- on Hadoop / Distributed computing on Hadoop

## E

- edges / Refresher on graphs
- efficient market basket analysis
- FP-Growth algorithm, used / Efficient market basket analysis using FP-Growth algorithm

- ensembling
- about / Ensembling
- voting / Ensembling
- averaging / Ensembling
- machine learning algorithm, used / Ensembling
- types / Types of ensembling
- bagging / Bagging
- boosting / Boosting
- advantages / Advantages and disadvantages of ensembling
- disadvantages / Advantages and disadvantages of ensembling
- random forest / Random forests
- Gradient boosted trees (GBTs) / Gradient boosted trees (GBTs)

## F

- feature selection
- filter methods / How do you select the best features to train your models?
- pearson correlation / How do you select the best features to train your models?
- chi-square / How do you select the best features to train your models?
- wrapper method / How do you select the best features to train your models?
- forward selection / How do you select the best features to train your models?
- backward elimination / How do you select the best features to train your models?
- embedded method / How do you select the best features to train your models?

- FP-Growth algorithm
- used, for efficient market basket analysis / Efficient market basket analysis using FP-Growth algorithm
- transaction dataset / Efficient market basket analysis using FP-Growth algorithm
- frequency of items, calculating / Efficient market basket analysis using FP-Growth algorithm
- priority, assigning to items / Efficient market basket analysis using FP-Growth algorithm
- array items, by priority / Efficient market basket analysis using FP-Growth algorithm
- FP-Tree, building / Efficient market basket analysis using FP-Growth algorithm
- frequent patterns, identifying from FP-Tree / Efficient market basket analysis using FP-Growth algorithm
- conditional patterns, mining / Efficient market basket analysis using FP-Growth algorithm
- conditional patterns, from leaf node Diapers / Efficient market basket analysis using FP-Growth algorithm
- executing, on Apache Spark / Running FP-Growth on Apache Spark

- Frequent Item sets / Efficient market basket analysis using FP-Growth algorithm
- Frequent Pattern Mining
- reference link / Running FP-Growth on Apache Spark

- Full Apriori algorithm
- about / Full Apriori algorithm
- dataset / Full Apriori algorithm
- apriori implementation / Full Apriori algorithm

## G

- Gradient boosted trees (GBTs)
- about / Advantages and disadvantages of ensembling, Gradient boosted trees (GBTs)
- dataset, used / Classification problem and dataset used
- issues, classifying / Classification problem and dataset used
- data exploration / Data exploration
- random forest model, training / Training and testing our random forest model
- random forest model, testing / Training and testing our random forest model
- gradient boosted tree model, testing / Training and testing our gradient boosted tree model
- gradient boosted tree model, training / Training and testing our gradient boosted tree model

- graph analytics
- about / Graph analytics
- path analytics / Graph analytics
- connectivity analytics / Graph analytics
- community analytics / Graph analytics
- centrality analytics / Graph analytics
- GraphFrames / GraphFrames
- GraphFrames, used for building a graph / Building a graph using GraphFrames
- on airports / Graph analytics on airports and their flights
- on flights / Graph analytics on airports and their flights
- datasets / Datasets
- on flights data / Graph analytics on flights data

- graphs
- refresher / Refresher on graphs
- representing / Representing graphs
- adjacency matrix / Representing graphs
- adjacency list / Representing graphs
- common terminology / Common terminology on graphs
- common algorithms / Common algorithms on graphs
- plotting / Plotting graphs

- graphs, common algorithms
- breadth first search / Common algorithms on graphs
- depth first search / Common algorithms on graphs
- dijkstra shortest path / Common algorithms on graphs
- PageRank algorithm / Common algorithms on graphs

- graphs, common terminology
- vertices / Common terminology on graphs
- edges / Common terminology on graphs
- degrees / Common terminology on graphs
- indegrees / Common terminology on graphs
- outdegrees / Common terminology on graphs

- GraphStream library
- reference link / Plotting graphs

## H

- Hadoop
- basics / Basics of Hadoop – a Java sub-project
- features / Basics of Hadoop – a Java sub-project
- distributed computing on / Distributed computing on Hadoop
- core / Distributed computing on Hadoop
- HDFS / Distributed computing on Hadoop

- Hadoop Distributed File System (HDFS)
- about / Distributed computing on Hadoop
- Open Source / Design and architecture of HDFS
- Immense scalability, for amount of data / Design and architecture of HDFS
- failover support / Design and architecture of HDFS
- fault tolerance / Design and architecture of HDFS
- data locality / Design and architecture of HDFS
- NameNode / Main components of HDFS
- DataNode / Main components of HDFS

- hand written digit recognizition
- using CNN / Hand written digit recognizition using CNN

- HBase / Real-time data processing
- HDFS concepts
- about / HDFS concepts
- architecture / Design and architecture of HDFS
- design / Design and architecture of HDFS
- components / Main components of HDFS
- simple commands / HDFS simple commands

- hierarchical clustering / Hierarchical clustering
- histogram
- about / Histograms
- using / When would you use a histogram?
- creating, JFreeChart used / How to make histograms using JFreeChart?

- human neuron
- dendrite / Introduction to neural networks
- cell body / Introduction to neural networks
- axom terminal / Introduction to neural networks

- hyperplane / Scatter plots, What is simple linear regression?

## I

- Impala
- used, for real-time SQL queries / Real-time SQL queries using Impala
- advantages / Real-time SQL queries using Impala
- flight delay analysis / Flight delay analysis using Impala
- Apache Kafka / Apache Kafka
- Spark Streaming / Spark Streaming, Typical uses of Spark Streaming
- trending videos / Trending videos

- Iris dataset
- reference link / Flower species classification using multi-Layer perceptrons

- IVTK Graph toolkit
- about / IVTK Graph toolkit
- other libraries / Other libraries

## J

- JFreeChart API
- dataset loading, Apache Spark used / Simple single Time Series chart
- chart object, creating / Simple single Time Series chart
- dataset object, filling / Bar charts
- chart component, creating / Bar charts

## K

- k-means clustering
- bisecting / Bisecting k-means clustering

- K-means clustering / K-means clustering

## L

- linear regression
- about / Linear regression
- using / Where is linear regression used?
- used, for predicting house prices / Predicting house prices using linear regression
- dataset / Dataset

- line charts / Line charts
- logistic regression
- about / Logistic regression
- mathematical functions, used / Which mathematical functions does logistic regression use?
- Gradient ascent or descent / Which mathematical functions does logistic regression use?
- Stochastic gradient descent / Which mathematical functions does logistic regression use?
- used for / Where is logistic regression used?
- heart disease, predicting / Where is logistic regression used?
- dataset / Dataset

## M

- machine learning
- about / What is machine learning?
- example / Real-life examples of machine learning
- at Netflix / Real-life examples of machine learning
- spam filter / Real-life examples of machine learning
- Hand writing detection, on cheque submitted via ATMs / Real-life examples of machine learning
- type / Type of machine learning
- supervised learning / Type of machine learning
- un-supervised learning / Type of machine learning
- semi supervised learning / Type of machine learning
- supervised learning, case study / A small sample case study of supervised and unsupervised learning
- unsupervised learning, case study / A small sample case study of supervised and unsupervised learning
- issues / Steps for machine learning problems
- model, selecting / Choosing the machine learning model
- training/test set / Choosing the machine learning model
- cross validation / Choosing the machine learning model
- features extracted from datasets / What are the feature types that can be extracted from the datasets?
- categorical features / What are the feature types that can be extracted from the datasets?
- numerical features / What are the feature types that can be extracted from the datasets?
- text features / What are the feature types that can be extracted from the datasets?
- features, selecting to train models / How do you select the best features to train your models?
- analytics, executing on big data / How do you run machine learning analytics on big data?
- data, preparing in Hadoop / Getting and preparing data in Hadoop
- data, obtaining in Hadoop / Getting and preparing data in Hadoop
- models, storing on big data / Training and storing models on big data
- models, training on big data / Training and storing models on big data
- Apache Spark machine learning API / Apache Spark machine learning API

- massive graphs
- on big data / Massive graphs on big data
- graph analytics / Graph analytics
- graph analytics, on airports / Graph analytics on airports and their flights

- maths stats
- mean squared error (MSE) / Bisecting k-means clustering
- median value / Box plots
- MNIST database
- reference link / Hand written digit recognizition using CNN

- model
- selecting / Training and storing models on big data
- training / Training and storing models on big data, Training and testing the model
- storing / Training and storing models on big data
- testing / Training and testing the model

- multi-Layer perceptron
- used, for flower species classification / Flower species classification using multi-Layer perceptrons

- multi-layer perceptron
- about / Multi-layer perceptrons
- accuracy / Accuracy of multi-layer perceptrons

- multiple linear regression / What is simple linear regression?

## N

- N-grams
- NameNode / Main components of HDFS
- Natural Language Processing (NLP) / What are the feature types that can be extracted from the datasets?, Concepts for sentimental analysis
- Naïve bayes algorithm
- about / Naive Bayes algorithm
- advantages / Advantages of Naive Bayes
- disadvantages / Disadvantages of Naive Bayes

- neural networks / Introduction to neural networks

## O

- OpenFlights airports database
- reference link / Datasets

## P

- paired RDDs
- about / Paired RDDs
- transformations / Transformations on paired RDDs

- perceptron
- about / Perceptron
- issues / Problems with perceptrons
- Logical AND / Problems with perceptrons
- Logical OR / Problems with perceptrons
- sigmoid neuron / Sigmoid neuron
- multi-layer perceptron / Multi-layer perceptrons

- PFP / Running FP-Growth on Apache Spark
- prefuse

## R

- random forest / Random forests
- real-time analytics
- about / Real-time analytics
- fraud analytics / Real-time analytics
- sensor data analysis (Internet of Things) / Real-time analytics
- recommendations, giving to users / Real-time analytics
- in healthcare / Real-time analytics
- ad-processing / Real-time analytics
- big data stack / Big data stack for real-time analytics

- real-time data ingestion / Real-time data ingestion and storage
- Apache Kafka / Real-time data ingestion and storage
- Apache Flume / Real-time data ingestion and storage
- HBase / Real-time data ingestion and storage
- Cassandra / Real-time data ingestion and storage

- real-time data processing / Real-time data processing
- Spark Streaming / Real-time data processing
- Storm / Real-time data processing

- real-time SQL queries
- on big data / Real-time SQL queries on big data
- impala / Real-time SQL queries on big data
- Apache Drill / Real-time SQL queries on big data
- Impala, used / Real-time SQL queries using Impala

- real-time storage / Real-time data ingestion and storage
- Recency, Frequency, and Monetary (RFM) / Customer segmentation
- recommendation system
- about / Recommendation systems and their types
- types / Recommendation systems and their types
- content-based recommendation systems / Content-based recommendation systems

- Resilient Distributed Dataset (RDD) / Concepts, Dataframe and datasets

## S

- scatter plots / Scatter plots
- sentimental analysis
- about / Sentimental analysis
- concepts / Concepts for sentimental analysis
- tokenization / Tokenization
- stemming / Stemming
- N-grams / N-grams
- term presence / Term presence and Term Frequency
- term frequency / Term presence and Term Frequency
- Term Frequency and Inverse Document Frequency (TF-IDF) / TF-IDF
- bag of words / Bag of words
- dataset / Dataset
- text data, data exploration / Data exploration of text data
- on dataset / Sentimental analysis on this dataset

- sigmoid neuron / Sigmoid neuron
- simple linear regression / Linear regression, What is simple linear regression?
- smoothing factor / Disadvantages of Naive Bayes
- SOLR / Real-time data processing
- SPAM Detector Model / Type of machine learning
- SparkConf
- building / Building SparkConf and context

- Spark ML / Apache Spark machine learning API
- Spark SQL
- used, for basic analysis on data / Basic analysis of data with Spark SQL
- SparkConf, building / Building SparkConf and context
- context, building / Building SparkConf and context
- dataframe / Dataframe and datasets
- datasets / Dataframe and datasets
- data, loading / Load and parse data
- data, parsing / Load and parse data

- Spark Streaming
- about / Spark Streaming, Typical uses of Spark Streaming
- use cases / Typical uses of Spark Streaming
- data collection, in real time / Typical uses of Spark Streaming
- storage, in real time / Typical uses of Spark Streaming
- predictive analytics, in real time / Typical uses of Spark Streaming
- windowed calculations / Typical uses of Spark Streaming
- cumulative calculations / Typical uses of Spark Streaming
- base project setup / Base project setup

- stemming / Stemming
- stop words removal / Stop words removal
- Storm / Spark Streaming
- sum of mean squared errors (SMEs) / Bisecting k-means clustering
- supervised learning
- about / Type of machine learning
- classification / Type of machine learning
- regression / Type of machine learning

- Support Vector Machine (SVM) / SVM or Support Vector Machine

## T

- tendency / Content-based recommendation systems
- term frequency
- about / Term presence and Term Frequency
- example / Term presence and Term Frequency

- Term Frequency and Inverse Document Frequency (TF-IDF) / TF-IDF
- TimeSeries chart
- about / Time Series chart
- all india seasonal / All India seasonal and annual average temperature series dataset
- annual average temperature series dataset / All India seasonal and annual average temperature series dataset
- simple single TimeSeries chart / Simple single Time Series chart
- multiple TimeSeries, on single chart window / Multiple Time Series on a single chart window

- tokenization
- about / Tokenization
- regular expression, used / Tokenization
- pre-trained model, used / Tokenization
- stop words removal / Stop words removal

- trending videos
- about / Trending videos
- sentiment analysis, at real time / Sentiment analysis in real time

## V

- vertexes / Refresher on graphs
- Visualization ToolKit (VTK)
- about / IVTK Graph toolkit
- URL / IVTK Graph toolkit

## W

- windowed calculations / Trending videos