Index
A
- Alternating Least Squares (ALS)
- Amazon EC2
- about / Launching Spark on Amazon EC2
- features / Launching Spark on Amazon EC2
- Spark, launching / Launching Spark on Amazon EC2, Getting ready, How to do it...
- URL / Getting ready
- Amazon Elastic Block Storage (EBS)
- about / Loading data from Amazon S3
- Amazon Elastic Compute Cloud (EC2)
- about / Loading data from Amazon S3
- Amazon S3
- data, loading / Loading data from Amazon S3, How to do it...
- about / Loading data from Amazon S3
- URL / Getting ready
- Amazon Web Services (AWS)
- about / Loading data from Amazon S3
- URL / How to do it...
- Apache Cassandra
- about / Loading data from Apache Cassandra
- data, loading / Loading data from Apache Cassandra, How to do it..., There's more...
- arbitrary source
- data, saving / Loading and saving data from an arbitrary source, How to do it...
- data, loading / Loading and saving data from an arbitrary source, How to do it...
B
- batch interval
- about / Introduction
- bias
- versus variance / Doing linear regression with lasso
- about / Doing linear regression with lasso
- binaries
- Spark, installing / Getting ready, How to do it...
- binary classification
- performing, with SVM / Doing binary classification using SVM, How to do it…
- bivariate analysis
- about / Introduction
- broker
- about / Streaming using Kafka
C
- case classes
- used, for inferring schema / Inferring schema using case classes, How to do it...
- Catalyst optimizer
- about / Understanding the Catalyst optimizer
- goals / How it works…
- using, in analysis phase / Analysis
- using, in logical plan optimization phase / Logical plan optimization
- using, in physical planning phase / Physical planning
- using, in code generation phase / Code generation
- classification
- about / Introduction
- performing, with logistic regression / Doing classification using logistic regression, Getting ready, How to do it…
- performing, with decision trees / Doing classification using decision trees, Getting ready, How to do it…, How it works…
- performing, with Random Forests / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
- performing, with Gradient Boosted Trees / Doing classification using Gradient Boosted Trees, How to do it…
- performing, with Naïve Bayes / Doing classification with Naïve Bayes, How to do it…
- cluster centroids
- about / Clustering using k-means
- clustering
- about / Introduction, Clustering using k-means
- k-means algorithm, using / Clustering using k-means, Getting ready, How to do it…
- collaborative filtering
- about / Collaborative filtering using explicit feedback
- explicit feedback, using / Collaborative filtering using explicit feedback, Getting ready, How to do it…
- implicit feedback, using / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
- comma-separate value (CSV) file
- about / Getting ready
- complex event processing (CEP)
- about / Introduction
- compression
- about / Using compression to improve performance
- used, for performance improvement / Using compression to improve performance
- concurrent mark and sweep (CMS)
- about / Optimizing memory
- connected component
- searching / Finding connected components, Getting ready, How to do it…
- Connector/J
- URL / How to do it...
- connector library
- about / There's more...
- consumers
- about / Streaming using Kafka
- correlation
- about / Calculating correlation
- calculating / Calculating correlation, Getting ready, How to do it…
- positive correlation / Calculating correlation
- negative correlation / Calculating correlation
- cost function
- about / Understanding cost function
- analyzing, for linear regression / Understanding cost function
- custom InputFormat
- used, for loading data from HDFS / Loading data from HDFS using a custom InputFormat, How to do it...
D
- data
- loading, from local filesystem / Loading data from the local filesystem, How to do it...
- loading, from HDFS / Loading data from HDFS, How to do it..., There's more…
- loading from HDFS, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
- loading, from Amazon S3 / Loading data from Amazon S3, How to do it...
- loading, from Apache Cassandra / Loading data from Apache Cassandra, How to do it..., There's more...
- loading, from relational databases / Loading data from relational databases, How to do it..., How it works…, Loading and saving data from relational databases, How to do it...
- loading, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
- saving, in Parquet format / Loading and saving data using the Parquet format, How to do it..., How it works…, There's more…
- loading, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
- saving, in JSON format / Loading and saving data using the JSON format, How to do it..., How it works…
- saving, from relational databases / Loading and saving data from relational databases, How to do it...
- loading, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
- saving, from arbitrary source / Loading and saving data from an arbitrary source, How to do it...
- DataFrame
- about / Introduction
- data rate
- about / Introduction
- data source API
- URL / There's more…
- decision trees
- classification, performing / Doing classification using decision trees, Getting ready, How to do it…, How it works…
- dimensionality reduction
- about / Dimensionality reduction with principal component analysis
- purposes / Dimensionality reduction with principal component analysis
- with principal component analysis (PCA) / Dimensionality reduction with principal component analysis, Getting ready, How to do it…
- with singular value decomposition (SVD) / Dimensionality reduction with singular value decomposition, Getting ready, How to do it…
- directed graph
- about / Introduction
- directories
- ephemeral-hdfs / How to do it...
- persistent-hdfs / How to do it...
- hadoop-native / How to do it...
- Scala / How to do it...
- Shark / How to do it...
- Spark / How to do it...
- spark-ec2 / How to do it...
- Tachyon / How to do it...
- Discretized Stream (DStream)
- about / Introduction
- distributed graph processing
- data parallel / Introduction
- graph parallel / Introduction
- distributed matrix
- about / Creating matrices
- RowMatrix / Creating matrices
- IndexedRowMatrix / Creating matrices
- CoordinateMatrix / Creating matrices
- domain-specific language (DSL)
- about / Introduction
E
- Eclipse
- Spark application, developing with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
- URL / Getting ready
- Spark application, developing with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
- Eden
- about / Optimizing memory
- ensemble learning algorithms
- Estimator
- about / Getting ready
- explicit feedback
- used, for collaborative filtering / Collaborative filtering using explicit feedback, Getting ready, How to do it…
F
- fat-free XML
- features, vectors
- about / Creating vectors
- feature scaling
- about / Getting ready
- performing / Getting ready
G
- garbage-first GC (G1)
- about / Optimizing memory
- garbage collection
- optimizing / Optimizing garbage collection, How to do it…
- garbage collector (GC)
- about / Optimizing memory
- Gradient Boosted Trees (GBTs)
- about / Doing classification using Gradient Boosted Trees
- classification, performing / Doing classification using Gradient Boosted Trees, How to do it…
- gradient descent
- about / Understanding cost function
- graphs
- directed graph / Introduction
- regular graph / Introduction
- fundamental operations / Fundamental operations on graphs, How to do it…
H
- Hadoop distributed file system (HDFS)
- about / How to do it...
- HDFS
- about / Introduction
- data, loading / Loading data from HDFS, How to do it..., There's more…
- data loading, custom InputFormat used / Loading data from HDFS using a custom InputFormat, How to do it...
- HiveContext
- about / Creating HiveContext
- features / Creating HiveContext
- creating / Creating HiveContext, Getting ready, How to do it...
- hyperspace
- about / Creating vectors
- hypothesis function
- about / Getting ready, Understanding cost function
- hypothesis testing
- about / Doing hypothesis testing
- performing / Doing hypothesis testing, How to do it…
I
- implicit feedback
- used, for collaborative filtering / Collaborative filtering using implicit feedback, Getting ready, How it works…, There's more…
- InputFormat storage format
- about / Introduction
- IntelliJ idea
- Spark application, developing with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
- Spark application, developing with SBT / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
J
- JdbcRDD
- JSON format
- data, loading / Loading and saving data using the JSON format, How to do it..., How it works…
- data, saving / Loading and saving data using the JSON format, How to do it..., How it works…
K
- k-means algorithm
- using / Clustering using k-means, Getting ready, How to do it…
- cluster assignment step / Clustering using k-means
- move centroid step / Clustering using k-means
- Kafka
- about / Streaming using Kafka
- using / Streaming using Kafka, How to do it..., There's more…
- kilobytes per second (kbps)
- about / Introduction
- Kryo library
L
- labeled point
- about / Creating a labeled point
- creating / Creating a labeled point, How to do it…
- lasso
- about / Doing linear regression with lasso
- linear regression, performing / Doing linear regression with lasso, How to do it…
- URL / Doing linear regression with lasso
- latent features
- about / Introduction
- level of parallelism
- optimizing / Optimizing the level of parallelism
- leverage application semantics
- used, for manual memory management / Manual memory management by leverage application semantics
- lineage
- about / Introduction
- linear regression
- about / Using linear regression, Understanding cost function
- using / Getting ready, How to do it…
- analyzing, for cost function / Understanding cost function
- performing, with lasso / Doing linear regression with lasso, How to do it…
- local filesystem
- data, loading / Loading data from the local filesystem, How to do it...
- local matrix
- about / Creating matrices
- logistic function
- logistic regression
- classification, performing / Doing classification using logistic regression, Getting ready, How to do it…
- LZO
M
- machine learning
- about / Introduction
- machine learning pipelines
- creating, ML library used / Creating machine learning pipelines using ML, Getting ready, How to do it…
- manual memory management
- by leverage application semantics / Manual memory management by leverage application semantics
- matrices
- about / Creating matrices
- creating / Creating matrices, How to do it…
- local matrix / Creating matrices
- distributed matrix / Creating matrices
- Maven
- Spark source code, building / Building the Spark source code with Maven, How to do it...
- Spark application, developing in Eclipse / Developing Spark applications in Eclipse with Maven, How to do it...
- about / Developing Spark applications in Eclipse with Maven
- features / Developing Spark applications in Eclipse with Maven
- Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
- measurement scales
- Nominal Scale / Introduction
- Ordinal Scale / Introduction
- Interval Scale / Introduction
- Ratio Scale / Introduction
- megabytes per second (mbps)
- about / Introduction
- memory optimization
- about / Optimizing memory
- improvements / Optimizing memory
- aspects / Optimizing memory
- Mesos
- about / Introduction, Deploying on a cluster with Mesos
- Spark, deploying / Deploying on a cluster with Mesos, How to do it...
- fine-grained mode / How to do it...
- coarse-grained mode / How to do it...
- ML library
- used, for creating machine learning pipelines / Creating machine learning pipelines using ML, Getting ready, How to do it…
- MovieLens dataset
- URL / Introduction
- multigraph
- about / Introduction
- multivariate analysis
- about / Introduction
N
- Naïve Bayes
- classification, performing / Doing classification with Naïve Bayes, How to do it…
- Naïve Bayes assumption
- Naïve Bayes classifier
- negative correlation
- about / Calculating correlation
- neighborhood aggregation
- performing / Performing neighborhood aggregation, How to do it…
- null hypothesis
- about / Doing hypothesis testing
O
- old collection
- about / Optimizing memory
- ordinary least squares (OLS)
- about / Doing linear regression with lasso
- prediction accuracy / Doing linear regression with lasso
- interpretation / Doing linear regression with lasso
- OutputFormat storage format
- about / Introduction
- overfitting
- about / How it works…
P
- PageRank
- about / Using PageRank
- using / Using PageRank, Getting ready, How to do it…
- parallel edges
- about / Introduction
- Parquet format
- partitioned log
- about / Streaming using Kafka
- performance improvement
- with compression / Using compression to improve performance
- with serialization / Using serialization to improve performance
- plain old Java objects (POJOs)
- positive correlation
- about / Calculating correlation
- principal component analysis (PCA)
- producers
- about / Streaming using Kafka
- projection error
- project Tungsten
- about / Understanding the future of optimization – project Tungsten
- manual memory management / Manual memory management by leverage application semantics
- algorithms, using / Using algorithms and data structures
- data structures, using / Using algorithms and data structures
- code generation / Code generation
Q
- Quasi quotes
- about / Code generation
R
- Random Forests
- classification, performing / Doing classification using Random Forests, Getting ready, How to do it…, How it works…
- RDD
- about / Introduction
- wordcount example / Introduction
- recommender systems
- about / Introduction
- regression
- about / Introduction
- relational databases
- resilient distributed property graph
- about / Introduction
- ridge regression
- about / Doing ridge regression
- performing / Doing ridge regression, How to do it…
- Root Mean Square Error (RMSE)
S
- s3*//
- about / How to do it...
- s3n*//
- about / How to do it...
- SBT
- about / Developing Spark applications in Eclipse with SBT
- Spark application, developing in Eclipse / Developing Spark applications in Eclipse with SBT, How to do it...
- Spark application, developing in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
- sbt-assembly plugin
- merge strategies / Merge strategies in sbt-assembly
- schema
- inferring, case classes used / Inferring schema using case classes, How to do it...
- programmatically specifying / Programmatically specifying the schema, How to do it..., How it works…
- SchemaRDD
- about / Introduction
- secure shell protocol (SSH)
- about / How to do it...
- serialization
- used, for performance improvement / Using serialization to improve performance
- sigmoid function
- singular value decomposition (SVD)
- sliding window, parameters
- window length / Introduction
- sliding interval / Introduction
- Snappy
- Spark
- about / Introduction
- ecosystem / Introduction
- URL / Installing Spark from binaries
- installing, from binaries / Getting ready, How to do it...
- source code, building with Maven / Building the Spark source code with Maven, How to do it...
- launching, on Amazon EC2 / Launching Spark on Amazon EC2, Getting ready, How to do it...
- deploying, on cluster in standalone mode / Deploying on a cluster in standalone mode, How to do it..., How it works...
- deploying, on cluster with Mesos / Deploying on a cluster with Mesos, How to do it...
- deploying, on cluster with YARN / Deploying on a cluster with YARN, How to do it..., How it works…
- spark-ec2 script
- about / Getting ready
- Spark 1.3 version
- URL / How to do it...
- Spark application
- developing, in Eclipse with Maven / Developing Spark applications in Eclipse with Maven, How to do it...
- developing, in Eclipse with SBT / Developing Spark applications in Eclipse with SBT, How to do it...
- developing, in IntelliJ idea with Maven / Developing a Spark application in IntelliJ IDEA with Maven, How to do it...
- developing, in IntelliJ idea / Developing a Spark application in IntelliJ IDEA with SBT, How to do it...
- Spark master
- about / How it works...
- Spark RDD
- about / Using Tachyon as an off-heap storage layer
- challenges / Using Tachyon as an off-heap storage layer
- Spark shell
- exploring / Exploring the Spark shell, How to do it...
- Spark SQL
- about / Introduction
- squared error function
- about / Understanding cost function
- Standalone mode
- about / Introduction
- reference link / See also
- standalone mode
- Spark, deploying / Deploying on a cluster in standalone mode, How to do it..., How it works...
- start-all.sh script
- about / How to do it...
- start-master.sh script
- about / How to do it...
- start-slaves.sh script
- about / How to do it...
- stop-all.sh script
- about / How to do it...
- stop-master.sh script
- about / How to do it...
- stop-slaves.sh script
- about / How to do it...
- Streaming
- about / Introduction
- used, for word count / Word count using Streaming, How to do it...
- with Kafka / Streaming using Kafka, How to do it..., There's more…
- subgraph
- about / Finding connected components
- summary statistics
- about / Calculating summary statistics
- calculating / Calculating summary statistics, How to do it…
- supervised learning
- about / Introduction, Introduction
- regression / Introduction
- classification / Introduction
- example / Introduction
- support vector machines (SVM)
- about / Introduction
- support vectors
- SVM
- binary classification, performing / Doing binary classification using SVM, How to do it…
T
- Tachyon
- about / Introduction
- using, as off-heap storage layer / Using Tachyon as an off-heap storage layer, How to do it...
- reference link / See also
- text classification
- topics
- about / Streaming using Kafka
- training data
- Twitter data
- live streaming / Streaming Twitter data, How to do it...
U
- unsupervised learning
- about / Introduction
- use case, clustering
- market segmentation / Clustering using k-means
- social network analysis / Clustering using k-means
- data center computing clusters / Clustering using k-means
- astronomical data analysis / Clustering using k-means
- real estate / Clustering using k-means
- text analysis / Clustering using k-means
V
- variance
- versus bias / Doing linear regression with lasso
- about / Doing linear regression with lasso
- vectors
- creating / Creating vectors, How it works...
W
- Wikipedia page link data
- URL / Getting ready
- word count
- with Streaming / Word count using Streaming, How to do it...
- worker
- about / How it works...
Y
- YARN
- about / Introduction, Deploying on a cluster with YARN
- Spark, deploying on cluster / Deploying on a cluster with YARN, How to do it..., How it works…
- yarn-client mode / How it works…
- yarn-cluster mode / How it works…
- configuration parameters / How it works…
- young collection
- about / Optimizing memory
Z
- z density of house
- about / Getting ready