Mastering Apache Spark

By : Mike Frampton

Mastering Apache Spark

By: Mike Frampton

Overview of this book

<p>Apache Spark is an in-memory cluster based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and SQL. It operates at unprecedented speeds, is easy to use and offers a rich set of data transformations.</p> <p>This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand Spark functionality. The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural net for handwriting recognition. You will then discover how stream processing can be tuned for optimal performance and to ensure parallel processing. The book extends to show how to incorporate H20 for machine learning, Titan for graph based storage, Databricks for cloud-based Spark. Intermediate Scala based code examples are provided for Apache Spark module processing in a CentOS Linux and Databricks cloud environment.</p>

Mastering Apache Spark

Credits

Foreword

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Apache Spark

Cloud

Summary

Apache Spark MLlib

The environment configuration

Classification with Naïve Bayes

Clustering with K-Means

ANN – Artificial Neural Networks

Summary

Apache Spark Streaming

Overview

Errors and recovery

Streaming sources

Summary

Apache Spark SQL

The SQL context

Importing and saving data

DataFrames

Using SQL

User-defined functions

Using Hive

Summary

Apache Spark GraphX

Overview

GraphX coding

Mazerunner for Neo4j

Summary

Graph-based Storage

Titan

Accessing Titan with Spark

Summary

Extending Spark with H2O

Overview

The processing environment

Installing H2O

The build environment

Summary

Spark Databricks

Overview

Installing Databricks

Notebooks and folders

Jobs and libraries

Development environments

Databricks tables

The DbUtils package

Summary

Databricks Visualization

Data visualization

REST interface

Moving data

Index

A

AAN
- about / K-Means in practice, ANN – Artificial Neural Networks
- theory / Theory
- Spark server, sparkling / Building the Spark server
- using / ANN in practice
account management, Databricks
- about / Account management
Amazon AWS
- URL / Amazon EC2
- pricing, URL / Amazon EC2
Amazon EC2
- about / Amazon EC2
- URL / Amazon EC2
Amazon Elastic Compute Cloud (EC2) / Installing Databricks
Apache Giraph / Overview
Apache Kafka
- URL / Kafka
- about / Kafka
- JAR library file, URL / Kafka
Apache Mesos / Apache Mesos
Apache Spark
- overview / Overview
- URL / Overview, Overview, Further reading
- Spark Machine Learning / Spark Machine Learning
- stream processing / Spark Streaming
- SQL module / Spark SQL
- graph processing / Spark graph processing
- extended eco system / Extended ecosystem
- future / The future of Spark
- cluster design / Cluster design
- cluster management / Cluster management
- performance, examining / Performance
- SQL context / The SQL context
- used, for accessing HBase / Accessing HBase with Spark
- used, for accessing Cassandra / Accessing Cassandra with Spark
- Titan, accessing with / Accessing Titan with Spark
Apache Spark streaming
- overview / Overview
- URL / Overview
- errors / Errors and recovery
- recovery / Errors and recovery
- HDFS-based checkpoint, setting up / Checkpointing
- data sources / Streaming sources
Apache YARN / Apache YARN
architecture, H2O / Architecture
Artificial Neural Net (ANN) / Sourcing the data
AWS
- URL / Installing Databricks
AWS billing / AWS billing

B

BaseConfiguration method / Alternative Groovy configuration
Bruce Penn
- URL / The Hadoop file system

C

Cassandra
- Titan, accessing with / Titan with Cassandra
- installing / Installing Cassandra
- accessing, with Apache Spark / Accessing Cassandra with Spark
classifications, with Naïve Bayes
- about / Classification with Naïve Bayes, Naïve Bayes in practice
- theory / Theory
closeness centrality algorithm
- about / The closeness centrality algorithm
Cloudera
- URL / Local Hive Metastore server
cluster design, Apache Spark / Cluster design
clustering, with K-Means
- about / Clustering with K-Means
- theory / Theory
cluster management
- about / Cluster management
- local mode / Local
- standalone mode / Standalone
- Apache YARN / Apache YARN
- Apache Mesos / Apache Mesos
- Amazon EC2 / Amazon EC2
cluster management, Databricks
- about / Cluster management
connected components algorithm
- about / The connected components algorithm

D

dashboards / Overview
data
- importing / Importing and saving data
- saving / Importing and saving data
- text files, processing / Processing the Text files
- JSON files, processing / Processing the JSON files
- Parquet files, processing / Processing the Parquet files
- sourcing / Sourcing the data
- quality / Data Quality
- moving / Moving data
- table data, importing / The table data
- folder, importing / Folder import
- library, importing / Library import
databases / Overview
Databricks
- URL / The future of Spark, Amazon EC2, Cloud, Overview, Further reading
- overview / Overview
- installing / Installing Databricks
- AWS billing / AWS billing
- menu / Databricks menus
- account management / Account management
- cluster management / Cluster management
- Notebooks / Notebooks and folders
- folder / Notebooks and folders
- jobs / Jobs and libraries
- libraries / Jobs and libraries
- references / Further reading
Databricks file system (DBFS) / The table data
Databricks tables
- about / Databricks tables
- creating, via data import / Data import
- external tables / External tables
DataFrames
- about / DataFrames
data sources, Apache Spark streaming
- Kafka / Overview
- Flume / Overview, Flume
- HDFS / Overview
- about / Streaming sources
- TCP stream / TCP stream
- file streams / File streams
- Apache Kafka / Kafka
DataStax Spark Cassandra connector
- URL / The Spark Cassandra connector
data visualization
- about / Data visualization
- dashboards / Dashboards
- RDD-based report / An RDD-based report
- stream-based report / A stream-based report
DBFS
- accessing / Databricks file system
dbutils.fs class
- about / External tables
dbutils package
- about / The DbUtils package
- DBFS / The DbUtils package
- fsutils group / Dbutils fsutils
- cache functionality / The DbUtils cache
- mount functionality / The DbUtils mount
deep learning
- about / Deep learning
- URL / Deep learning
- Scala-based H2O Sparkling Water example / Example code – income
- MNIST / The example code – MNIST
development environments, Databricks
- about / Development environments
discrete stream (DStream) / Overview
Docker
- URL / Installing Docker
- installing / Installing Docker

E

end of file markers (EOF) / Using Cassandra
environment, H2O
- processing / The processing environment
environment configuration, MLlib
- architecture / Architecture
- development environment / The development environment
- Spark, installing / Installing Spark
Extract, Transform, Load (ETL)
- about / Architecture

F

False Positive Rate (FPR) / H2O Flow
Flume
- about / Flume
- URL / Flume
folder / Notebooks and folders

G

graph, creating
- counting example / Example 1 – counting
- filtering example / Example 2 – filtering
- PageRank algorithm / Example 3 – PageRank
- triangle counting / Example 4 – triangle counting
- connected components / Example 5 – connected components
GraphInputFormat class / Using HBase
graph processing, Apache Spark / Spark graph processing
GraphX
- overview / Overview
- coding / GraphX coding
GraphX coding
- about / GraphX coding
- environment / Environment
- graph, creating / Creating a graph
Gremlin language / TinkerPop

H

H2O
- overview / Overview
- environment, processing / The processing environment
- system versions, URL / The processing environment
- installing / Installing H2O
- Sparkling Water download option, URL / Installing H2O
- build environment / The build environment
- architecture / Architecture
- URL / Architecture
- performance tuning / Performance tuning
H2O flow
- about / H2O Flow
- URL / H2O Flow
hadoop / The development environment
Hadoop file system / The Hadoop file system
Hadoop Gremlin
- URL / TinkerPop's Hadoop Gremlin
HBase
- Titan, accessing with / Titan with HBase
- accessing, with Apache Spark / Accessing HBase with Spark
head function / Dbutils fsutils
Hernan Amiune
- URL / Theory
Hive
- using / Using Hive
- local Metastore server / Local Hive Metastore server
- -based Metastore server / A Hive-based Metastore server
Hive-based Metastore server
- using / A Hive-based Metastore server

J

JavaScript Object Notation (JSON) files
- processing / Processing the JSON files
jobs
- about / Jobs and libraries

K

K-Means
- clustering / Clustering with K-Means
- using / K-Means in practice

L

LabeledPoint
- URL / Naïve Bayes in practice
libraries
- about / Jobs and libraries
local Hive Metastore server
- using / Local Hive Metastore server

M

markdown
- URL / Notebooks and folders
Mazerunner, for Neo4j
- about / Mazerunner for Neo4j
- Docker, installing / Installing Docker
- Neo4j browser / The Neo4j browser
- algorithms / The Mazerunner algorithms
Mazerunner algorithms
- about / The Mazerunner algorithms
- PageRank algorithm / The PageRank algorithm
- closeness centrality algorithm / The closeness centrality algorithm
- triangle count algorithm / The triangle count algorithm
- connected components algorithm / The connected components algorithm
- strongly connected components algorithm / The strongly connected components algorithm
MLlib
- environment configuration / The environment configuration
MNIST
- URL / Sourcing the data
- about / The example code – MNIST

N

Naïve Bayes
- classification / Classification with Naïve Bayes
- using / Naïve Bayes in practice
- URL / Naïve Bayes in practice
Neo4j browser
- about / The Neo4j browser
- URL / The Neo4j browser
Notebook / Notebooks and folders

O

OOM (Out of Memory) messages / Memory
Oryx system
- URL / Cloud

P

P (Spam|Buy) / Theory
PageRank algorithm
- about / The PageRank algorithm
Parquet files
- about / Importing and saving data
- processing / Processing the Parquet files
performance
- examining / Performance
- cluster structure / The cluster structure
- Hadoop file system / The Hadoop file system
- data locality / Data locality
- OOM (Out of Memory) messages, avoiding / Memory
- code, tuning / Coding
PostgreSQL connector library
- URL, for download / A Hive-based Metastore server
PredictionIO
- URL / Cloud

R

remove function(rm) / Dbutils fsutils
REST interface
- about / REST interface
- configuration / Configuration
- cluster management / Cluster management
- execution context / The execution context
- command execution / Command execution
- libraries / Libraries

S

SeldonIO
- URL / Cloud
Sister property / Overview
Sparkling Water component, H2O
- URL / The processing environment, Architecture
Spark Machine Learning / Spark Machine Learning
SparkOnHBase module
- URL / Spark on HBase
Spark SQL / Spark SQL
SQL
- using / Using SQL
SQL context
- about / The SQL context
streaming, Apache Spark / Spark Streaming
stream processing / Spark Streaming
strongly connected components algorithm
- about / The strongly connected components algorithm

T

tertiary education / Data visualization
textFile method / Processing the Text files
text files
- processing / Processing the Text files
TinkerPop
- about / TinkerPop
- URL / TinkerPop
Titan
- about / Titan
- URL / Titan, Installing Titan
- installing / Installing Titan
- accessing, with HBase / Titan with HBase
- accessing, with Cassandra / Titan with Cassandra
- accessing, with Apache Spark / Accessing Titan with Spark
Titan, accessing with Apache Spark
- about / Accessing Titan with Spark
- Gremlin shell / Gremlin and Groovy
- Groovy commands, executing / Gremlin and Groovy
- TinkerPop Hadoop Gremlin package / TinkerPop's Hadoop Gremlin
- alternative Groovy configuration / Alternative Groovy configuration
- Cassandra, using / Using Cassandra
- HBase, using / Using HBase
- file system, using / Using the filesystem
Titan, accessing with Cassandra
- about / Titan with Cassandra
- Cassandra, installing / Installing Cassandra
- Gremlin Cassandra script / The Gremlin Cassandra script
- Spark Cassandra connector / The Spark Cassandra connector
Titan, accessing with HBase
- about / Titan with HBase
- HBase cluster, using / The HBase cluster
- Gremlin HBase script / The Gremlin HBase script
- SparkOnHBase module, using / Spark on HBase
TitanFactory.open method / Using Cassandra
triangle count algorithm
- about / The triangle count algorithm
True Positive Rate (TPR) / H2O Flow
Twitter
- URL / A stream-based report

U

user-defined functions (UDFs)
- about / User-defined functions

V

velox system
- URL / Cloud
Vendor AP / TinkerPop

Mastering Apache Spark

By : Mike Frampton

Mastering Apache Spark

By: Mike Frampton

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Apache Spark

Index

A

B

C

D

E

F

G

H

J

K

L

M

N

O

P

R

S

T

U

V