Apache Spark 2: Data Processing and Real-Time Analytics

Book Image

Apache Spark 2: Data Processing and Real-Time Analytics

By : Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Book Image

Apache Spark 2: Data Processing and Real-Time Analytics

By: Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Overview of this book

Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: • Mastering Apache Spark 2.x by Romeo Kienzler • Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla • Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook

Title Page

Copyright

About Packt

Contributors

Preface

Free Chapter

A First Taste and What's New in Apache Spark V2

A First Taste and What's New in Apache Spark V2

Spark machine learning

Spark Streaming

Spark graph processing

Extended ecosystem

What's new in Apache Spark V2?

Cluster management

Cloud-based deployments

Apache Spark Streaming

Apache Spark Streaming

Errors and recovery

Streaming sources

Structured Streaming

Structured Streaming

The concept of continuous applications

Increased performance with good old friends

How transparent fault tolerance and exactly-once delivery guarantee is achieved

Example - connection to a MQTT message broker

Apache Spark MLlib

Apache Spark MLlib

Classification with Naive Bayes

Clustering with K-Means

Artificial neural networks

Apache SparkML

What does the new API look like?

The concept of pipelines

Model evaluation

CrossValidation and hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Apache SystemML

Apache SystemML

Why do we need just another library?

A cost-based optimizer for machine learning algorithms

Performance measurements

Apache SystemML in action

Apache Spark GraphX

Apache Spark GraphX

Graph analytics/processing with GraphX

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Testing and Debugging Spark

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Practical Machine Learning with Spark Using Scala

Practical Machine Learning with Spark Using Scala

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Running a sample ML code from Spark

Identifying data sources for practical machine learning

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to add graphics to your Spark program

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Creating RDDs with Spark 2.0 using internal data sources

Creating RDDs with Spark 2.0 using external data sources

Transforming RDDs with Spark 2.0 using the filter() API

Transforming RDDs with the super useful flatMap() API

Transforming RDDs with set operation APIs

RDD transformation/aggregation with groupBy() and reduceByKey()

Transforming RDDs with the zip() API

Join transformation with paired key-value RDDs

Reduce and grouping transformation with paired key-value RDDs

Creating DataFrames from Scala data structures

Operating on DataFrames programmatically without SQL

Loading DataFrames and setup from an external source

Using DataFrames with standard SQL language - SparkSQL

Working with the Dataset API using a Scala Sequence

Creating and using Datasets from RDDs and back again

Working with JSON using the Dataset API and SQL together

Functional programming with the Dataset API using domain objects

Common Recipes for Implementing a Robust Machine Learning System

Common Recipes for Implementing a Robust Machine Learning System

Spark's basic statistical API to help you build your own algorithms

ML pipelines for real-life machine learning applications

Normalizing data with Spark

Splitting data for training and testing

Common operations with the new Dataset API

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0

LabeledPoint data structure for Spark ML

Getting access to Spark cluster in Spark 2.0

Getting access to Spark cluster pre-Spark 2.0

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0

New model export and PMML markup in Spark 2.0

Regression model evaluation using Spark 2.0

Binary classification model evaluation using Spark 2.0

Multiclass classification model evaluation using Spark 2.0

Multilabel classification model evaluation using Spark 2.0

Using the Scala Breeze library to do graphics in Spark 2.0

Recommendation Engine that Scales with Spark

Recommendation Engine that Scales with Spark

Setting up the required data for a scalable recommendation engine in Spark 2.0

Exploring the movies data details for the recommendation system in Spark 2.0

Exploring the ratings data details for the recommendation system in Spark 2.0

Building a scalable recommendation engine using collaborative filtering in Spark 2.0

Unsupervised Clustering with Apache Spark 2.0

Unsupervised Clustering with Apache Spark 2.0

Building a KMeans classifying system in Spark 2.0

Bisecting KMeans, the new kid on the block in Spark 2.0

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0

Latent Dirichlet Allocation (LDA) to classify documents and text into topics

Streaming KMeans to classify data in near real-time

Implementing Text Analytics with Spark 2.0 ML Library

Implementing Text Analytics with Spark 2.0 ML Library

Doing term frequency with Spark - everything that counts

Displaying similar words with Spark using Word2Vec

Downloading a complete dump of Wikipedia for a real-life Spark ML project

Using Latent Semantic Analysis for text analytics with Spark 2.0

Topic modeling with Latent Dirichlet allocation in Spark 2.0

Spark Streaming and Machine Learning Library

Spark Streaming and Machine Learning Library

Structured streaming for near real-time machine learning

Streaming DataFrames for real-time machine learning

Streaming Datasets for real-time machine learning

Streaming data and debugging with queueStream

Downloading and understanding the famous Iris data for unsupervised classification

Streaming KMeans for a real-time on-line classifier

Downloading wine quality data for streaming regression

Streaming linear regression for a real-time regression

Downloading Pima Diabetes data for supervised classification

Streaming logistic regression for an on-line classifier

Other Books You May Enjoy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Appendix 1. Other Books You May Enjoy

If you enjoyed this book, you may be interested in these other books by Packt:

Modern Scala Projects

Ilango Gurusamy

ISBN: 9781788624114

Create pipelines to extract data or analytics and visualizations
Automate your process pipeline with jobs that are reproducible
Extract intelligent data efficiently from large, disparate datasets
Automate the extraction, transformation, and loading of data
Develop tools that collate, model, and analyze data
Maintain the integrity of data as data flows become more complex
Develop tools that predict outcomes based on “pattern discovery”
Build really fast and accurate machine-learning models in Scala

Apache Spark Deep Learning CookbookAhmed Sherif and Amrith Ravindra

ISBN: 9781788474221

Set up a fully functional Spark environment
Understand practical machine learning and deep learning concepts
Apply built-in machine learning libraries within Spark
Explore libraries that are compatible with TensorFlow and Keras
Explore NLP models such as Word2vec and TF-IDF on Spark
Organize dataframes for deep learning evaluation
Apply testing and training modeling to ensure accuracy
Access readily available code that may be reusable