Index
A
- Abstract Syntax Tree (AST) / RDD Transformations versus Dataset and DataFrames Transformations
- Abstract Syntax Trees (AST) / Optimization
- Accelerated Failure Time (AFT)
- about / Machine learning with SparkR
- Access Control Lists (ACLs)
- about / Features of HDFS
- accumulators / Shared variables
- actions
- about / Transformations and actions
- example / Transformations and actions
- collect / Actions
- count / Actions
- describe / Actions
- show / Actions
- take / Actions
- advanced sources
- about / Advanced sources
- Kafka / Advanced sources
- file stream / Advanced sources
- Kinesis / Advanced sources
- Twitter / Advanced sources
- ZeroMQ / Advanced sources
- MQTT / Advanced sources
- Alternating Least Squares (ALS)
- Amazon Web Services (AWS)
- about / Advanced sources
- Ambari service
- used, for installing Apache Zeppelin / Ambari service
- analytics, with DataFrames
- about / Analytics with DataFrames
- analytics, with Dataset API / Analytics with the Dataset API
- Apache Giraph
- Apache Hadoop
- about / Introducing Apache Hadoop
- components / Introducing Apache Hadoop
- adoption drivers / Introducing Apache Hadoop
- characteristics / Introducing Apache Hadoop
- Hadoop Distributed File System (HDFS) / Hadoop Distributed File System
- MapReduce (MR) / MapReduce
- Yet Another Resource Negotiator (YARN) / YARN
- storage options / Storage options on Hadoop
- features / Hadoop features
- Apache Hama
- Apache Mahout
- Apache NiFi
- for dataflows / Introducing Apache NiFi for dataflows
- dataflow challenges, resolving / Introducing Apache NiFi for dataflows
- installing / Installing Apache NiFi
- analytics / Dataflows and analytics with NiFi
- dataflows, handling / Dataflows and analytics with NiFi
- Apache Spark
- about / Introducing Apache Spark, What is Apache Spark?
- history / Spark history
- URL / What is Apache Spark?
- limitations / What Apache Spark is not
- MapReduce (MR), issues / MapReduce issues
- reference link / MapReduce issues
- versus MapReduce (MR) / MapReduce issues
- stack components / Spark's stack
- ecosystem / Spark's stack
- packages, URL / Spark's stack
- combining, with Hadoop / Why Hadoop plus Spark?
- features / Spark features
- storage performance / Frequently asked questions about Spark
- fault recovery / Frequently asked questions about Spark
- Apache Zeppelin
- about / Introducing Apache Zeppelin
- versus Jupyter / Jupyter versus Zeppelin
- installing / Installing Apache Zeppelin
- installing, with Ambari service / Ambari service
- installing, manually / The manual method
- analytics / Analytics with Zeppelin
- Livy REST job server, using / Using Livy with Zeppelin
- Apache Zeppelin, components
- frontend / Introducing Apache Zeppelin
- Zeppelin Server / Introducing Apache Zeppelin
- Pluggable Interpreter System / Introducing Apache Zeppelin
- interpreters / Introducing Apache Zeppelin
- application container
- about / YARN
- applications
- monitoring / Monitoring applications
- Automatic Schema Discovery / Automatic Schema Discovery
- AVRO
- working with / Working with AVRO
- Avro
B
- basic Dataset/DataFrame functions
- about / Basic Dataset/DataFrame functions
- As[U] / Basic Dataset/DataFrame functions
- toDF / Basic Dataset/DataFrame functions
- explain / Basic Dataset/DataFrame functions
- printSchema / Basic Dataset/DataFrame functions
- createTempView / Basic Dataset/DataFrame functions
- cache() / Basic Dataset/DataFrame functions
- persist() / Basic Dataset/DataFrame functions
- basic sources
- about / Basic sources
- TCP stream / Basic sources
- file stream / Basic sources
- Akka actors / Basic sources
- queue of RDDs / Basic sources
- batch API
- URL / A batch session
- batch session
- about / A batch session
- beeline client
- used, for querying data / Querying data using beeline client
- Berkeley Data Analytics Stack (BDAS)
- about / Spark history
- Big Data analytics
- about / A typical Big Data analytics project life cycle
- project, life cycle / A typical Big Data analytics project life cycle
- issues, identifying / Identifying the problem and outcomes
- outcomes, identifying / Identifying the problem and outcomes
- necessary data, identifying / Identifying the necessary data
- data collection / Data collection
- data, preprocessing / Preprocessing data and ETL
- ETL, preprocessing / Preprocessing data and ETL
- performing / Performing analytics
- data, visualizing / Visualizing data
- Hadoop and Spark, role / The role of Hadoop and Spark
- tools and techniques / Tools and techniques
- Big Data science
- Hadoop and Spark, role / Big Data science and the role of Hadoop and Spark, The role of Hadoop and Spark
- data analytics, shifting from / A fundamental shift from data analytics to data science
- data scientists, versus software engineers / Data scientists versus software engineers
- data scientists, versus data analysts / Data scientists versus data analysts
- data scientists, versus business analysts / Data scientists versus business analysts
- project life cycle / A typical data science project life cycle
- Big Data science, project life cycle
- about / A typical data science project life cycle
- hypothesis and modeling / Hypothesis and modeling
- effectiveness, measuring / Measuring the effectiveness
- improvements, making / Making improvements
- results, communicating / Communicating the results
- BI Tools
- integrating with / Integration with BI tools
- broadcast variables / Shared variables
- built-in sources
- about / Built-in sources
- text files / Working with text files
- JSON / Working with JSON
- Parquet / Working with Parquet
- Optimized Row Columnar (ORC) / Working with ORC
- JDBC / Working with JDBC
- CSV / Working with CSV
- Bulk Synchronous Parallel (BSP)
- Business Intelligence (BI) / Architecture of Spark SQL, Spark SQL as a distributed SQL engine
- business intelligence (BI) / Parquet
C
- caching / Persistence and caching
- catalog
- used, accessing metadata / Accessing metadata using Catalog
- Catalyst / Architecture of Spark SQL
- checkpointing
- driver failures, recovering / Recovering with checkpointing
- Classic MapReduce
- about / MapReduce v1 versus MapReduce v2
- classification
- Naive Bayes / Supervised learning
- Decision Trees / Supervised learning
- ensembles algorithms / Supervised learning
- Cloudera Distribution for Hadoop (CDH)
- installing / Installing Hadoop plus Spark clusters
- URL / Installing Hadoop plus Spark clusters
- working with / Working with CDH
- clustering algorithms
- about / Unsupervised learning
- K-Means / Unsupervised learning
- Gaussian Mixure / Unsupervised learning
- Power Iteration Clustering (PIC) / Unsupervised learning
- Latent Dirichlet Allocation (LDA) / Unsupervised learning
- Streaming K-Means / Unsupervised learning
- cluster resource managers
- about / Cluster resource managers
- standalone mode / Standalone
- YARN / YARN
- collaborative filtering
- about / Recommender systems, Collaborative filtering
- Alternating Least Squares (ALS) / Recommender systems
- user-based collaborative filtering / User-based collaborative filtering
- item-based collaborative filtering / Item-based collaborative filtering
- reference link / A recommendation system with MLlib
- column pruning / Working with ORC
- common Dataset/DataFrame operations
- about / Common Dataset/DataFrame operations
- input and output operations / Input and Output Operations
- built-in functions / Built-in functions, aggregate functions, and window functions
- aggregate functions / Built-in functions, aggregate functions, and window functions
- window functions / Built-in functions, aggregate functions, and window functions
- components, Spark SQL
- compression formats
- about / Compression formats
- standard compression formats / Standard compression formats
- configuration parameters, for submitting applications
- --master / Important application configurations
- --class / Important application configurations
- --deploy-mode / Important application configurations
- --conf / Important application configurations
- --py-files / Important application configurations
- --supervise / Important application configurations
- --driver-memory / Important application configurations
- --executor-memory / Important application configurations
- --total-executor-cores / Important application configurations
- --num-executors / Important application configurations
- --executor-cores / Important application configurations
- connected components
- about / Connected components
- content-based filtering
- about / Content-based filtering
- continuous bag of words
- CSV
- working with / Working with CSV
- custom sources
- about / Custom sources
D
- DAG (Directed Acyclic Graph) / Lineage Graph
- data
- caching / Caching data
- querying, beeline client used / Querying data using beeline client
- querying, spark-sql CLI used / Querying data from Hive using spark-sql CLI
- dataflows
- with Apache NiFi / Introducing Apache NiFi for dataflows
- reference link / Dataflows and analytics with NiFi
- DataFrame
- creating, from DataSources API / Creating a DataFrame from a DataSources API
- creating, from Hive / Creating a DataFrame from Hive
- using, with SparkR / Using DataFrames with SparkR
- DataFrame API
- DataFrames
- about / Spark's stack
- evolution / Evolution of DataFrames and Datasets
- using, scenarios / When to use RDDs, Datasets, and DataFrames?
- creating / Creating DataFrames
- creating, from structured data files / Creating DataFrames from structured data files
- creating, from RDDs / Creating DataFrames from RDDs
- creating, from Hive tables / Creating DataFrames from tables in Hive
- creating, from external databases / Creating DataFrames from external databases
- converting, to RDDs / Converting DataFrames to RDDs
- converting, to Datasets / Converting a DataFrame to a Dataset
- creating, for recommendation system with MLlib / Exploring the data with DataFrames
- DataFrames, benefits
- about / Why Datasets and DataFrames?
- optimization / Optimization
- speed / Speed
- Automatic Schema Discovery / Automatic Schema Discovery
- multiple sources / Multiple sources, multiple languages
- multiple languages / Multiple sources, multiple languages
- interoperability, between RDDs / Interoperability between RDDs and others
- predicates, pushing to source systems / Select and read necessary data only
- data locality / Data locality
- Dataset API
- Datasets
- about / Spark's stack
- evolution / Evolution of DataFrames and Datasets
- using, scenarios / When to use RDDs, Datasets, and DataFrames?
- creating / Creating Datasets
- DataFrame, converting to / Converting a DataFrame to a Dataset
- converting, to DataFrames / Converting a Dataset to a DataFrame
- DataFrames, converting to / Converting a Dataset to a DataFrame
- DataSources API
- DataFrame, creating / Creating a DataFrame from a DataSources API
- Data Sources API
- about / Introducing SQL, Datasources, DataFrame, and Dataset APIs, Data Sources API
- advantages / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- read functions / Read and write functions
- write functions / Read and write functions
- data types, Spark MLlib
- local vector / Spark MLlib data types
- labeled point / Spark MLlib data types
- local matrix / Spark MLlib data types
- distributed matrix / Spark MLlib data types
- Decision Trees
- about / Supervised learning
- dense vector
- about / Exploring the Mahout shell
- Dimensionality Reduction
- about / Unsupervised learning
- Singular Value Decomposition (SVD) / Unsupervised learning
- Principal Component Analysis (PCA) / Unsupervised learning
- direct approach, Kafka
- about / Direct approach (no receivers)
- benefits / Direct approach (no receivers)
- Directed Acyclic Graph (DAG) / RDD Transformations versus Dataset and DataFrames Transformations, Optimization
- about / What is a graph?
- Discretized Stream
- about / Architecture of Spark Streaming
- distributed matrix
- RowMatrix / Spark MLlib data types
- IndexedRowMatrix / Spark MLlib data types
- CoordinateMatrix / Spark MLlib data types
- BlockMatrix / Spark MLlib data types
- Domain Specific Language (DSL / Common Dataset/DataFrame operations
- domain specific language (DSL) / History of Spark SQL
- Domain Specific Language (DSL)
- Domain Specific Language (DSL) functions
- agg / DSL functions
- distinct / DSL functions
- drop / DSL functions
- filter / DSL functions
- join / DSL functions
- limit / DSL functions
- sort / DSL functions
- groupby / DSL functions
- unionAll / DSL functions
- na / DSL functions
- driver failures
- about / Failure of driver
- recovering, with checkpointing / Recovering with checkpointing
- recovering, with WAL / Recovering with WAL
- DStream
- about / Architecture of Spark Streaming
- benefits / Architecture of Spark Streaming
E
- EdgeRDD operations
- about / VertexRDD and EdgeRDD operations
- mapping / Mapping VertexRDD and EdgeRDD
- joining / Joining EdgeRDDs
- edge directions, reversing / Reversing edge directions
- ensemble algorithms
- about / Supervised learning
- Enterprise Data Warehouse (EDW) optimization / Real-life use cases
- explicit feedback
- versus implicit feedback / Explicit versus implicit feedback
- external databases
- DataFrames, creating from / Creating DataFrames from external databases
- external data sources
- about / External sources
- Extract, Transform, and Load (ETL) / Big Data analytics and the role of Hadoop and Spark
- Extract, Transform and Load (ETL) / Avro
- Extra Packages for Enterprise Linux (EPEL)
- about / Installing and configuring R
F
- fault-tolerance, Spark Streaming
- failure of executor / Failure of executor
- failure of driver / Failure of driver
- feature extraction and transformation
- about / Feature extraction and transformation
- term frequency / Feature extraction and transformation
- Word2Vec / Feature extraction and transformation
- Standard Scaler / Feature extraction and transformation
- Normalizer / Feature extraction and transformation
- Chi-Square Selector / Feature extraction and transformation
- file formats
- about / File formats
- sequence file / Sequence file
- protocol buffers / Protocol buffers and thrift
- thrift / Protocol buffers and thrift
- Avro / Avro
- Parquet / Parquet
- Record Columnar File (RCFile) / RCFile and ORCFile
- Optimized Row Columnar (ORC) / RCFile and ORCFile
- flight data
G
- Gradient-boosted Trees
- about / Supervised learning
- graph
- about / What is a graph?
- algorithms / Graph algorithms
- graph databases
- versus graph processing systems / Graph databases versus graph processing systems
- GraphFrames
- about / Introducing GraphFrames
- motif finding algorithm / Motif finding
- loading / Loading and saving GraphFrames
- saving / Loading and saving GraphFrames
- graph processing
- about / Introducing graph processing
- graph processing systems
- versus graph databases / Graph databases versus graph processing systems
- graph transformation
- about / Transforming graphs
- attributes, transforming / Transforming attributes
- graphs, modifying / Modifying graphs
- graphs, joining / Joining graphs
- VertexRDD operations / VertexRDD and EdgeRDD operations
- EdgeRDD operations / VertexRDD and EdgeRDD operations
- GraphX
- about / Spark's stack, Introducing GraphX, Getting started with GraphX
- flight data, analyzing / Analyzing flight data using GraphX
- Pregel API, implementing / Pregel API
- GraphX, algorithms
- about / GraphX algorithms
- PageRank / GraphX algorithms
- triangle counting / Triangle counting
- connected components / Connected components
- GraphX operations
- about / Basic operations of GraphX
- graph, creating / Creating a graph
- counting / Counting
- graph, filtering / Filtering
- inDegrees / inDegrees, outDegrees, and degrees
- outDegrees / inDegrees, outDegrees, and degrees
- degrees / inDegrees, outDegrees, and degrees
- triplets / Triplets
- graphs, transforming / Transforming graphs
- groupEdges operator
- about / Modifying graphs
H
- H2O
- machine learning / Machine learning with H2O and Spark
- Sparkling Water / Why Sparkling Water?
- URL / Getting started with Sparkling Water
- H2O Flow
- Hadoop
- machine learning / Machine learning on Spark and Hadoop
- Hadoop Distributed File System (HDFS)
- Hadoop Distributed File System (HDFS), features
- high availability / Features of HDFS
- data integrity / Features of HDFS
- HDFS ACLs / Features of HDFS
- Snapshots / Features of HDFS
- HDFS rebalancing / Features of HDFS
- caching / Features of HDFS
- APIs / Features of HDFS
- data encryption / Features of HDFS
- Kerberos authentication / Features of HDFS
- NFS access / Features of HDFS
- metrics / Features of HDFS
- rack awareness / Features of HDFS
- storage policies / Features of HDFS
- WORM / Features of HDFS
- Hadoop file formats
- leveraging, in Spark / Leveraging Hadoop file formats in Spark
- Hadoop plus Spark clusters
- installing / Installing Hadoop plus Spark clusters
- Hadoop User Experience (Hue)
- HBase
- Spark Streaming / Spark Streaming with Kafka and HBase
- integration with / Integration with HBase
- Hive
- DataFrame, creating / Creating a DataFrame from Hive
- Hivemall
- about / Introducing Hivemall
- benefits / Introducing Hivemall
- compatible JAR file, URL / Introducing Hivemall
- reference link / Introducing Hivemall
- Hivemall for Spark
- about / Introducing Hivemall for Spark
- URL / Introducing Hivemall for Spark
- reference link / Introducing Hivemall for Spark
- Hive on Spark project / Hive on Spark
- Hive query language (HiveQL)
- about / Creating a DataFrame from Hive
- Hive tables
- DataFrames, creating from / Creating DataFrames from tables in Hive
- Hortonworks DataFlow (HDF)
- Hortonworks Data Platform (HDP)
- working with / Working with HDP, MapR, and Spark pre-built packages
- Hortonworks Data Platform (HDP) Sandbox
- installing / Installing Hadoop plus Spark clusters
- URL / Installing Hadoop plus Spark clusters
- Hue Notebook
- Livy REST job server, using / Using Livy with Hue Notebook
- Hue Notebooks
I
- Idempotent updates
- about / Output stores
- implicit feedback
- versus explicit feedback / Explicit versus implicit feedback
- input sources
- about / Input sources and output stores
- basic sources / Input sources and output stores, Basic sources
- advanced sources / Input sources and output stores, Advanced sources
- custom sources / Input sources and output stores, Custom sources
- receivers, reliability / Receiver reliability
- integrated development environment (IDE)
- about / Using SparkR with RStudio
- interactive session
- about / An interactive session
- Internet of Things (IOT)
- interpreter binding
- about / Analytics with Zeppelin
- reference link / Using Livy with Zeppelin
- Inverse Document Frequency (IDF)
- IPython kernel
- URL / Installing Jupyter
- item-based collaborative filtering
J
- Java Management Extensions (JMX)
- about / Features of HDFS
- Java serialization / Serialization
- JDBC
- working with / Working with JDBC
- join operation
- about / Join
- JSON
- working with / Working with JSON
- Jupyter
- about / Introducing Jupyter
- installing / Installing Jupyter
- analytics / Analytics with Jupyter
- versus Apache Zeppelin / Jupyter versus Zeppelin
K
- k-means model
- using / Using the k-means model
- Kafka
- Spark Streaming / Spark Streaming with Kafka and HBase
- receiver-based approach / Receiver-based approach
- direct approach / Direct approach (no receivers)
- Kerberos Security Enabled Spark Cluster
- Spark applications, connecting to / Connecting to the Kerberos Security Enabled Spark Cluster
- Kinesis Client Library (KCL)
- about / Advanced sources
- Kryo serialization / Serialization
L
- Latent Dirichlet Allocation (LDA)
- about / Machine learning algorithms
- lazy evaluation / Lazy evaluation
- Lineage Graph / Lineage Graph
- Livy REST job server
- about / The Livy REST job server and Hue Notebooks
- components / The Livy REST job server and Hue Notebooks
- installing / Installing and configuring the Livy server and Hue
- configuring / Installing and configuring the Livy server and Hue
- using / Using the Livy server
- interactive session / An interactive session
- batch session / A batch session
- Spark Contexts, sharing / Sharing SparkContexts and RDDs
- RDDs, sharing / Sharing SparkContexts and RDDs
- using, with Hue Notebook / Using Livy with Hue Notebook
- using, with Apache Zeppelin / Using Livy with Zeppelin
- local DataFrame
- creating / Creating a local DataFrame
- logistic regression
- used, for spam detection / Logistic regression for spam detection
M
- machine learning
- about / Introducing machine learning
- advantages / Introducing machine learning
- disadvantages / Introducing machine learning
- on Spark / Machine learning on Spark and Hadoop
- on Hadoop / Machine learning on Spark and Hadoop
- with H2O / Machine learning with H2O and Spark
- with Spark / Machine learning with H2O and Spark
- with SparkR / Machine learning with SparkR
- Naive Bayes model, using / Using the Naive Bayes model
- k-means model, using / Using the k-means model
- machine learning algorithms
- about / Machine learning algorithms
- supervised learning / Supervised learning
- unsupervised learning / Unsupervised learning
- recommender systems / Recommender systems
- feature extraction and transformation / Feature extraction and transformation
- optimization / Optimization
- Spark MLlib, data types / Spark MLlib data types
- example / An example of machine learning algorithms
- logistic regression, for spam detection / Logistic regression for spam detection
- Machine Learning Library (MLlib)
- about / Pros and cons of Spark Streaming
- machine learning pipelines
- building / Building machine learning pipelines, Building an ML pipeline
- DataFrame / Building machine learning pipelines
- Transformer / Building machine learning pipelines
- Estimator / Building machine learning pipelines
- Pipeline / Building machine learning pipelines
- Parameters / Building machine learning pipelines
- workflow, example / An example of a pipeline workflow
- models, saving / Saving and loading models
- models, loading / Saving and loading models
- Mahout
- integrating, with Spark / The Mahout and Spark integration
- installing / Installing Mahout
- universal recommendation system, building / Building a universal recommendation system with Mahout and search tool
- Mahout shell
- exploring / Exploring the Mahout shell
- dense vector / Exploring the Mahout shell
- sparse vector / Exploring the Mahout shell
- MapR
- working with / Working with HDP, MapR, and Spark pre-built packages
- MapR Control System (MCS) / Working with HDP, MapR, and Spark pre-built packages
- Mapr Control System (MCS)
- MapReduce (MR)
- about / MapReduce, Introducing Apache Spark
- issues / MapReduce issues
- versus Apache Spark / MapReduce issues
- MapReduce (MR), features
- data locality / MapReduce features
- APIs / MapReduce features
- distributed cache / MapReduce features
- combiner / MapReduce features
- custom partitioner / MapReduce features
- sorting / MapReduce features
- joining / MapReduce features
- counters / MapReduce features
- MapReduce v1
- versus MapReduce v2 / MapReduce v1 versus MapReduce v2
- challenges / MapReduce v1 challenges
- MapR Sandbox
- installing / Installing Hadoop plus Spark clusters
- URL / Installing Hadoop plus Spark clusters
- mapWithState operation
- about / mapWithState
- Markdowns
- reference link / Analytics with Jupyter
- mask operator
- about / Modifying graphs
- Mesos
- Message Passing Interface (MPI)
- metadata
- accessing, catalog used / Accessing metadata using Catalog
- MLlib
- about / Spark's stack, Machine learning on Spark and Hadoop
- spark.mllib / Machine learning on Spark and Hadoop
- spark.ml / Machine learning on Spark and Hadoop
- modes, for running Spark
- local mode / Cluster resource managers
- standalone mode / Cluster resource managers
- YARN mode / Cluster resource managers
- Mesos mode / Cluster resource managers
- motif finding algorithm
- about / Motif finding
- MR job
- about / MapReduce issues
N
- Naive Bayes
- about / Supervised learning
- Naive Bayes model
- using / Using the Naive Bayes model
- natural language processing (NLP) / A fundamental shift from data analytics to data science
- nbconvert tool
- about / Introducing Jupyter
- nbviewer tool
- about / Introducing Jupyter
- URL / Introducing Jupyter
- NextGen
- about / MapReduce v1 versus MapReduce v2
- NiFi templates
- NodeManager
- about / YARN
O
- Online Analytical Processing (OLAP) / Tools and techniques
- optimization algorithms
- about / Optimization
- Stochastic Gradient Descent / Optimization
- Limited-memory BFGS (L-BFGS) / Optimization
- Optimized Row Columnar (ORC)
- about / RCFile and ORCFile
- working with / Working with ORC
- reference / Working with ORC
- output operations
- about / Output operations
- print()/pprint() / Output operations
- saveAsTextFiles / Output operations
- saveAsObjectFiles / Output operations
- saveAsHadoopFile / Output operations
- saveAsNewAPIHadoopDataset / Output operations
- saveToCassandra / Output operations
- foreachRDD(func) / Output operations
- output stores
P
- packages, Spark
- PageRank
- about / GraphX algorithms
- Pair RDDs / Pair RDDs
- Pandas
- working with / Working with Pandas
- parallelism, in RDDs / Parallelism in RDDs
- Parquet
- about / Parquet
- use case / Parquet
- working with / Working with Parquet
- reference / Working with Parquet
- partition pruning / Working with ORC
- performance tuning parameters, Spark SQL
- spark.sql.inMemoryColumnarStorage.compressed / Performance optimizations
- spark.sql.inMemoryColumnarStorage.batchSize / Performance optimizations
- spark.sql.autoBroadcastJoinThreshold / Performance optimizations
- spark.sql.files.maxPartitionBytes / Performance optimizations
- spark.sql.shuffle.partitions / Performance optimizations
- spark.sql.planner.externalSort / Performance optimizations
- persistence / Persistence and caching
- personally identifiable information (PII) / Identifying the necessary data
- pipelining / Pipelining
- Power Iteration Clustering (PIC)
- about / Machine learning algorithms
- predicate pushdown / Working with ORC
- Pregel API
- implementing / Pregel API
- Principal Component Analysis (PCA)
- about / Machine learning algorithms
- protocol buffers
- about / Protocol buffers and thrift
- public movielens data
- Python DataFrame operations
- reference / Speed
R
- R
- about / Introducing R and SparkR, What is R?
- features / What is R?
- limitations / What is R?
- installing / Installing and configuring R
- configuring / Installing and configuring R
- Random Forests
- about / Supervised learning
- RDD actions
- reference / Transformations and actions
- RDD operations
- transformations / Transformations and actions
- actions / Transformations and actions
- about / RDD operations
- RDDs
- issues / What's wrong with RDDs?
- using, scenarios / When to use RDDs, Datasets, and DataFrames?
- DataFrames, creating from / Creating DataFrames from RDDs
- DataFrames, converting to / Converting DataFrames to RDDs
- sharing / Sharing SparkContexts and RDDs
- creating, for recommendation system with MLlib / Creating RDDs
- RDD transformations
- reference / Transformations and actions
- RDD Transformations
- versus Dataset and DataFrame Transformations / RDD Transformations versus Dataset and DataFrames Transformations
- Read, Evaluate, Print, and Loop (REPL)
- about / Introducing web-based notebooks
- real-life use cases
- about / Real-life use cases
- real-time processing
- about / Introducing real-time processing
- Spark Streaming, pros and cons / Pros and cons of Spark Streaming
- Spark Streaming, history / History of Spark Streaming
- receiver-based approach, Kafka
- about / Receiver-based approach
- Zookeeper / Role of Zookeeper
- receivers
- reliability / Receiver reliability
- reliable receiver / Receiver reliability
- unreliable receiver / Receiver reliability
- recommendation system, with MLlib
- building / A recommendation system with MLlib
- environment, preparing / Preparing the environment
- RDDs, creating / Creating RDDs
- data, exploring with DataFrames / Exploring the data with DataFrames
- testing dataset, creating / Creating training and testing datasets
- training dataset, creating / Creating training and testing datasets
- model, creating / Creating a model
- predictions, creating / Making predictions
- model, evaluating with testing data / Evaluating the model with testing data
- model accuracy, checking / Checking the accuracy of the model
- explicit feedback, versus implicit feedback / Explicit versus implicit feedback
- recommendation systems
- building / Building recommendation systems
- examples / Building recommendation systems
- content-based filtering / Content-based filtering
- collaborative filtering / Collaborative filtering
- limitations / Limitations of a recommendation system
- recommender systems
- about / Recommender systems
- collaborative filtering / Recommender systems
- Record Columnar File (RCFile)
- about / RCFile and ORCFile
- regression
- about / Supervised learning
- Linear Regression / Supervised learning
- Logistic Regression / Supervised learning
- Support Vector Machines / Supervised learning
- Relational Database Management Service (RDBMS) / Evolution of DataFrames and Datasets
- Relational Database Management Systems (RDBMS) / Big Data analytics and the role of Hadoop and Spark
- reliable receiver
- about / Receiver reliability
- REPL (read-eval-print loop) / Spark Shell
- resilient distributed dataset (RDD) / MapReduce issues
- Resilient Distributed Dataset (RDD) / Learning Spark core concepts
- collection, parallelizing / Method 1 – parallelizing a collection
- data, reading from file / Method 2 – reading from a file
- files, reading from HDFS / Reading files from HDFS
- High Availability (HA), used for reading files from HDFS / Reading files from HDFS with HA enabled
- Resilient Distributed Dataset (RDDs)
- about / Resilient Distributed Dataset
- parallelism / Parallelism in RDDs
- Resilient Distributed Datasets (RDD)
- about / What is a graph?
- ResourceManager
- about / YARN
- REST API
- reverse operator
- about / Modifying graphs
- R project
- URL / What is R?
- RStudio
- SparkR, using / Using SparkR with RStudio
S
- Samsara
- scheduling modes, Mesos
- Schema-on-Read (SOR) approach / Big Data analytics and the role of Hadoop and Spark
- Schema-on-Write approach / Big Data analytics and the role of Hadoop and Spark
- SchemaRDD
- about / History of Spark SQL
- search tool
- universal recommendation system, building / Building a universal recommendation system with Mahout and search tool
- sequence file
- about / Sequence file
- use case / Sequence file
- serialization / Serialization
- Java serialization / Serialization
- Kryo serialization / Serialization
- shared variables / Shared variables
- Shark
- issues / History of Spark SQL
- about / History of Spark SQL
- Singular Value Decomposition (SVD)
- about / Machine learning algorithms
- skip-gram
- spam detection
- with logistic regression / Logistic regression for spam detection
- Spark
- Hadoop file formats, leveraging in / Leveraging Hadoop file formats in Spark
- terminologies / Lifecycle of Spark program
- storage levels / Storage levels
- machine learning / Machine learning on Spark and Hadoop, Machine learning with H2O and Spark
- Mahout, integrating with / The Mahout and Spark integration
- Spark-on-HBase connector / DataFrame based Spark-on-HBase connector
- spark-sql CLI
- used, for querying data / Querying data from Hive using spark-sql CLI
- spark.mllib package
- spark.ml package
- Spark applications
- about / Spark applications, Spark applications
- connceting, to Kerberos Security Enabled Spark Cluster / Connecting to the Kerberos Security Enabled Spark Cluster
- versus Spark shell / Spark Shell versus Spark applications
- SparkConf
- about / SparkConf
- Spark configuration
- precedence / Spark Conf precedence order
- Spark context
- about / Spark context
- creating / Creating a Spark context
- Spark Contexts
- sharing / Sharing SparkContexts and RDDs
- Spark Core
- about / Spark's stack
- Spark daemons
- about / Starting Spark daemons
- starting, for standalone resource manager / Working with CDH
- Sparkling Water
- about / Why Sparkling Water?
- on YARN / An application flow on YARN
- downloading / Getting started with Sparkling Water
- URL / Getting started with Sparkling Water
- reference link / Getting started with Sparkling Water
- Sparkling Water project
- SparkMagic
- Spark MLlib
- data types / Spark MLlib data types
- Spark packages
- about / External sources
- Spark pre-built package
- working with / Working with HDP, MapR, and Spark pre-built packages
- Spark program
- lifecycle / Lifecycle of Spark program
- SparkR
- about / Spark's stack, Introducing R and SparkR, Introducing SparkR
- DataSources API / Introducing SparkR
- DataFrame optimizations / Introducing SparkR
- reference link / Introducing SparkR
- higher scalability / Introducing SparkR
- architecture / Architecture of SparkR
- exploring / Getting started with SparkR
- R, installing / Installing and configuring R
- R, configuring / Installing and configuring R
- scripts, using / Using SparkR scripts
- DataFrame, using / Using DataFrames with SparkR
- using, with RStudio / Using SparkR with RStudio
- machine learning / Machine learning with SparkR
- using, with Zeppelin / Using SparkR with Zeppelin
- Spark resource managers
- about / Spark resource managers – Standalone, YARN, and Mesos
- local mode, versus cluster mode / Local versus cluster mode
- cluster resource managers / Cluster resource managers
- selecting / Which resource manager to use?
- SparkR shell
- using / Using SparkR shell
- local mode / Local mode
- standalone mode / Standalone mode
- Yarn mode / Yarn mode
- local DataFrame, creating / Creating a local DataFrame
- DataFrame, creating from DataSources API / Creating a DataFrame from a DataSources API
- DataFrame, creating from Hive / Creating a DataFrame from Hive
- Spark Scala shell
- exploring / Exploring the Spark Scala shell
- SparkSession
- creating / Creating SparkSession
- Spark shell
- about / Spark Shell
- versus Spark applications / Spark Shell versus Spark applications
- Spark SQL
- about / Spark's stack
- history / History of Spark SQL
- architecture / Architecture of Spark SQL
- components / Introducing SQL, Datasources, DataFrame, and Dataset APIs
- performance tuning parameters / Performance optimizations
- as distributed SQL engine / Spark SQL as a distributed SQL engine
- Spark SQL Thrift Server
- for JDBC/ODBC access / Spark SQL's Thrift server for JDBC/ODBC access
- Spark Streaming
- about / Spark's stack
- pros and cons / Pros and cons of Spark Streaming
- history / History of Spark Streaming
- architecture / Architecture of Spark Streaming
- reference link / Architecture of Spark Streaming
- application flow / Spark Streaming application flow
- stateful stream processing / Stateless and stateful stream processing
- stateless stream processing / Stateless and stateful stream processing
- transformations / Spark Streaming transformations and actions
- actions / Spark Streaming transformations and actions
- with Kafka / Spark Streaming with Kafka and HBase
- with HBase / Spark Streaming with Kafka and HBase
- advanced concepts / Advanced concepts of Spark Streaming
- DataFrames, using / Using DataFrames
- MLlib operations / MLlib operations
- caching/persistence / Caching/persistence
- fault-tolerance / Fault-tolerance in Spark Streaming
- applications performance, tuning / Performance tuning of Spark Streaming applications
- SparkSubmit
- about / SparkSubmit
- sparse vector
- about / Exploring the Mahout shell
- SQL
- standalone mode, Spark cluster resource managers / Standalone
- standalone resource manager
- Spark daemons, starting for / Working with CDH
- standard compression formats
- about / Standard compression formats
- usage / Standard compression formats
- stateful stream processing
- stateless stream processing
- storage levels, Spark
- MEMORY_ONLY / Storage levels
- MEMORY_AND_DISK / Storage levels
- MEMORY_ONLY_SER / Storage levels
- MEMORY_AND_DISK_SER / Storage levels
- DISK_ONLY / Storage levels
- MEMORY_ONLY_2 / Storage levels
- MEMORY_AND_DISK_2 / Storage levels
- OFF_HEAP (experimental) / Storage levels
- selecting / What level to choose?
- storage options, Apache Hadoop
- about / Storage options on Hadoop
- file formats / File formats
- compression formats / Compression formats
- Streaming DataFrames
- about / Streaming Datasets and Streaming DataFrames
- output sinks / Input sources and output sinks
- input sources / Input sources and output sinks
- operations / Operations on Streaming Datasets and Streaming DataFrames
- Streaming Datasets
- about / Streaming Datasets and Streaming DataFrames
- input sources / Input sources and output sinks
- output sinks / Input sources and output sinks
- operations / Operations on Streaming Datasets and Streaming DataFrames
- StreamingListener API
- URL / Monitoring applications
- structured data files
- DataFrames, creating from / Creating DataFrames from structured data files
- Structured Streaming
- about / Spark's stack, Introducing Structured Streaming
- limitations / Introducing Structured Streaming
- application flow / Structured Streaming application flow
- usage / When to use Structured Streaming?
- Streaming Datasets / Streaming Datasets and Streaming DataFrames
- Streaming DataFrames / Streaming Datasets and Streaming DataFrames
- subgraph operator
- about / Modifying graphs
- supervised learning
- about / Supervised learning
- classification / Supervised learning
- regression / Supervised learning
T
- Tachyon
- about / Spark's stack
- term frequency (TF)
- terminologies, Spark
- application / Lifecycle of Spark program
- driver program / Lifecycle of Spark program
- cluster manager / Lifecycle of Spark program
- worker node / Lifecycle of Spark program
- executor / Lifecycle of Spark program
- DAG / Lifecycle of Spark program
- job / Lifecycle of Spark program
- stage / Lifecycle of Spark program
- task / Lifecycle of Spark program
- test data
- about / Introducing machine learning
- text files
- working with / Working with text files
- thrift
- about / Protocol buffers and thrift
- training data
- about / Introducing machine learning
- Transactional updates
- about / Output stores
- transformations
- about / Transformations and actions
- example / Transformations and actions
- transformations, Spark Streaming
- about / Spark Streaming transformations and actions
- union / Union
- join / Join
- transform operation / Transform operation
- updateStateByKey / updateStateByKey
- mapWithState / mapWithState
- window operations / Window operations
- output operations / Output operations
- transform operation
- about / Transform operation
- triangle counting
- about / Triangle counting
- Tungsten
- about / Spark's stack
U
- union operation
- about / Union
- universal recommendation system
- building, with Mahout and search tool / Building a universal recommendation system with Mahout and search tool
- unreliable receiver
- about / Receiver reliability
- unsupervised learning
- about / Unsupervised learning
- clustering algorithms / Unsupervised learning
- Dimensionality Reduction / Unsupervised learning
- updateStateByKey operation
- about / updateStateByKey
- user-based collaborative filtering
- User Defined Functions (UDFs)
- about / Introducing Hivemall
- User Defined Table Functions (UDTFs)
- about / Introducing Hivemall
V
- VertexRDD operations
- about / VertexRDD and EdgeRDD operations
- mapping / Mapping VertexRDD and EdgeRDD
- filtering / Filtering VertexRDDs
- joining / Joining VertexRDDs
- virtual machines (VM)
- about / Installing Hadoop plus Spark clusters
- prerequisites / Installing Hadoop plus Spark clusters
W
- WAL
- driver failures, recovering / Recovering with WAL
- web-based notebooks
- about / Introducing web-based notebooks
- window operations
- about / Window operations
- window / Window operations
- countByWindow / Window operations
- reduceByWindow / Window operations
- reduceByKeyAndWindow / Window operations
- countByValueAndWindow / Window operations
- write-ahead logs (WAL)
- about / History of Spark Streaming
- Write Once and Read Many (WORM)
- about / Features of HDFS
X
- XML
- working with / Working with XML
Y
- YARN
- about / YARN
- dynamic resource allocation / Dynamic resource allocation
- client mode, versus cluster mode / Client mode versus cluster mode
- Sparkling Water, submitting / An application flow on YARN
- YARN settings
- reference / Client mode versus cluster mode
- Yet Another Resource Negotiator (YARN)
- about / Introducing Apache Hadoop, YARN
Z
- Zeppelin
- SparkR, using / Using SparkR with Zeppelin
- ZeppelinHub Viewer
- URL / Analytics with Zeppelin
- Zeppelin notebooks
- URL / Analytics with Zeppelin
- Zookeeper
- about / Role of Zookeeper