Index
A
- AAN
- about / K-Means in practice, ANN – Artificial Neural Networks
- theory / Theory
- Spark server, sparkling / Building the Spark server
- using / ANN in practice
- account management, Databricks
- about / Account management
- Amazon AWS
- URL / Amazon EC2
- pricing, URL / Amazon EC2
- Amazon EC2
- about / Amazon EC2
- URL / Amazon EC2
- Amazon Elastic Compute Cloud (EC2) / Installing Databricks
- Apache Giraph / Overview
- Apache Kafka
- Apache Mesos / Apache Mesos
- Apache Spark
- overview / Overview
- URL / Overview, Overview, Further reading
- Spark Machine Learning / Spark Machine Learning
- stream processing / Spark Streaming
- SQL module / Spark SQL
- graph processing / Spark graph processing
- extended eco system / Extended ecosystem
- future / The future of Spark
- cluster design / Cluster design
- cluster management / Cluster management
- performance, examining / Performance
- SQL context / The SQL context
- used, for accessing HBase / Accessing HBase with Spark
- used, for accessing Cassandra / Accessing Cassandra with Spark
- Titan, accessing with / Accessing Titan with Spark
- Apache Spark streaming
- overview / Overview
- URL / Overview
- errors / Errors and recovery
- recovery / Errors and recovery
- HDFS-based checkpoint, setting up / Checkpointing
- data sources / Streaming sources
- Apache YARN / Apache YARN
- architecture, H2O / Architecture
- Artificial Neural Net (ANN) / Sourcing the data
- AWS
- URL / Installing Databricks
- AWS billing / AWS billing
B
- BaseConfiguration method / Alternative Groovy configuration
- Bruce Penn
- URL / The Hadoop file system
C
- Cassandra
- Titan, accessing with / Titan with Cassandra
- installing / Installing Cassandra
- accessing, with Apache Spark / Accessing Cassandra with Spark
- classifications, with Naïve Bayes
- about / Classification with Naïve Bayes, Naïve Bayes in practice
- theory / Theory
- closeness centrality algorithm
- Cloudera
- cluster design, Apache Spark / Cluster design
- clustering, with K-Means
- about / Clustering with K-Means
- theory / Theory
- cluster management
- about / Cluster management
- local mode / Local
- standalone mode / Standalone
- Apache YARN / Apache YARN
- Apache Mesos / Apache Mesos
- Amazon EC2 / Amazon EC2
- cluster management, Databricks
- about / Cluster management
- connected components algorithm
D
- dashboards / Overview
- data
- importing / Importing and saving data
- saving / Importing and saving data
- text files, processing / Processing the Text files
- JSON files, processing / Processing the JSON files
- Parquet files, processing / Processing the Parquet files
- sourcing / Sourcing the data
- quality / Data Quality
- moving / Moving data
- table data, importing / The table data
- folder, importing / Folder import
- library, importing / Library import
- databases / Overview
- Databricks
- URL / The future of Spark, Amazon EC2, Cloud, Overview, Further reading
- overview / Overview
- installing / Installing Databricks
- AWS billing / AWS billing
- menu / Databricks menus
- account management / Account management
- cluster management / Cluster management
- Notebooks / Notebooks and folders
- folder / Notebooks and folders
- jobs / Jobs and libraries
- libraries / Jobs and libraries
- references / Further reading
- Databricks file system (DBFS) / The table data
- Databricks tables
- about / Databricks tables
- creating, via data import / Data import
- external tables / External tables
- DataFrames
- about / DataFrames
- data sources, Apache Spark streaming
- Kafka / Overview
- Flume / Overview, Flume
- HDFS / Overview
- about / Streaming sources
- TCP stream / TCP stream
- file streams / File streams
- Apache Kafka / Kafka
- DataStax Spark Cassandra connector
- data visualization
- about / Data visualization
- dashboards / Dashboards
- RDD-based report / An RDD-based report
- stream-based report / A stream-based report
- DBFS
- accessing / Databricks file system
- dbutils.fs class
- about / External tables
- dbutils package
- about / The DbUtils package
- DBFS / The DbUtils package
- fsutils group / Dbutils fsutils
- cache functionality / The DbUtils cache
- mount functionality / The DbUtils mount
- deep learning
- about / Deep learning
- URL / Deep learning
- Scala-based H2O Sparkling Water example / Example code – income
- MNIST / The example code – MNIST
- development environments, Databricks
- about / Development environments
- discrete stream (DStream) / Overview
- Docker
- URL / Installing Docker
- installing / Installing Docker
E
- end of file markers (EOF) / Using Cassandra
- environment, H2O
- processing / The processing environment
- environment configuration, MLlib
- architecture / Architecture
- development environment / The development environment
- Spark, installing / Installing Spark
- Extract, Transform, Load (ETL)
- about / Architecture
F
- False Positive Rate (FPR) / H2O Flow
- Flume
- folder / Notebooks and folders
G
- graph, creating
- counting example / Example 1 – counting
- filtering example / Example 2 – filtering
- PageRank algorithm / Example 3 – PageRank
- triangle counting / Example 4 – triangle counting
- connected components / Example 5 – connected components
- GraphInputFormat class / Using HBase
- graph processing, Apache Spark / Spark graph processing
- GraphX
- overview / Overview
- coding / GraphX coding
- GraphX coding
- about / GraphX coding
- environment / Environment
- graph, creating / Creating a graph
- Gremlin language / TinkerPop
H
- H2O
- overview / Overview
- environment, processing / The processing environment
- system versions, URL / The processing environment
- installing / Installing H2O
- Sparkling Water download option, URL / Installing H2O
- build environment / The build environment
- architecture / Architecture
- URL / Architecture
- performance tuning / Performance tuning
- H2O flow
- hadoop / The development environment
- Hadoop file system / The Hadoop file system
- Hadoop Gremlin
- HBase
- Titan, accessing with / Titan with HBase
- accessing, with Apache Spark / Accessing HBase with Spark
- head function / Dbutils fsutils
- Hernan Amiune
- URL / Theory
- Hive
- using / Using Hive
- local Metastore server / Local Hive Metastore server
- -based Metastore server / A Hive-based Metastore server
- Hive-based Metastore server
- using / A Hive-based Metastore server
J
- JavaScript Object Notation (JSON) files
- processing / Processing the JSON files
- jobs
- about / Jobs and libraries
K
- K-Means
- clustering / Clustering with K-Means
- using / K-Means in practice
L
- LabeledPoint
- URL / Naïve Bayes in practice
- libraries
- about / Jobs and libraries
- local Hive Metastore server
- using / Local Hive Metastore server
M
- markdown
- URL / Notebooks and folders
- Mazerunner, for Neo4j
- about / Mazerunner for Neo4j
- Docker, installing / Installing Docker
- Neo4j browser / The Neo4j browser
- algorithms / The Mazerunner algorithms
- Mazerunner algorithms
- about / The Mazerunner algorithms
- PageRank algorithm / The PageRank algorithm
- closeness centrality algorithm / The closeness centrality algorithm
- triangle count algorithm / The triangle count algorithm
- connected components algorithm / The connected components algorithm
- strongly connected components algorithm / The strongly connected components algorithm
- MLlib
- environment configuration / The environment configuration
- MNIST
- URL / Sourcing the data
- about / The example code – MNIST
N
- Naïve Bayes
- classification / Classification with Naïve Bayes
- using / Naïve Bayes in practice
- URL / Naïve Bayes in practice
- Neo4j browser
- about / The Neo4j browser
- URL / The Neo4j browser
- Notebook / Notebooks and folders
O
P
- P (Spam|Buy) / Theory
- PageRank algorithm
- about / The PageRank algorithm
- Parquet files
- about / Importing and saving data
- processing / Processing the Parquet files
- performance
- examining / Performance
- cluster structure / The cluster structure
- Hadoop file system / The Hadoop file system
- data locality / Data locality
- OOM (Out of Memory) messages, avoiding / Memory
- code, tuning / Coding
- PostgreSQL connector library
- URL, for download / A Hive-based Metastore server
- PredictionIO
- URL / Cloud
R
- remove function(rm) / Dbutils fsutils
- REST interface
- about / REST interface
- configuration / Configuration
- cluster management / Cluster management
- execution context / The execution context
- command execution / Command execution
- libraries / Libraries
S
- SeldonIO
- URL / Cloud
- Sister property / Overview
- Sparkling Water component, H2O
- Spark Machine Learning / Spark Machine Learning
- SparkOnHBase module
- URL / Spark on HBase
- Spark SQL / Spark SQL
- SQL
- using / Using SQL
- SQL context
- about / The SQL context
- streaming, Apache Spark / Spark Streaming
- stream processing / Spark Streaming
- strongly connected components algorithm
T
- tertiary education / Data visualization
- textFile method / Processing the Text files
- text files
- processing / Processing the Text files
- TinkerPop
- Titan
- about / Titan
- URL / Titan, Installing Titan
- installing / Installing Titan
- accessing, with HBase / Titan with HBase
- accessing, with Cassandra / Titan with Cassandra
- accessing, with Apache Spark / Accessing Titan with Spark
- Titan, accessing with Apache Spark
- about / Accessing Titan with Spark
- Gremlin shell / Gremlin and Groovy
- Groovy commands, executing / Gremlin and Groovy
- TinkerPop Hadoop Gremlin package / TinkerPop's Hadoop Gremlin
- alternative Groovy configuration / Alternative Groovy configuration
- Cassandra, using / Using Cassandra
- HBase, using / Using HBase
- file system, using / Using the filesystem
- Titan, accessing with Cassandra
- about / Titan with Cassandra
- Cassandra, installing / Installing Cassandra
- Gremlin Cassandra script / The Gremlin Cassandra script
- Spark Cassandra connector / The Spark Cassandra connector
- Titan, accessing with HBase
- about / Titan with HBase
- HBase cluster, using / The HBase cluster
- Gremlin HBase script / The Gremlin HBase script
- SparkOnHBase module, using / Spark on HBase
- TitanFactory.open method / Using Cassandra
- triangle count algorithm
- about / The triangle count algorithm
- True Positive Rate (TPR) / H2O Flow
- Twitter
- URL / A stream-based report
U
- user-defined functions (UDFs)
- about / User-defined functions