Large Scale Machine Learning with Spark

Large Scale Machine Learning with Spark

By : Md. Rezaul Karim, Md. Mahedi Kaysar

Buy this Book

Large Scale Machine Learning with Spark

By: Md. Rezaul Karim, Md. Mahedi Kaysar

Buy this Book

Overview of this book

Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application. Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce.This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications. This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you’ll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark. In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.

Large Scale Machine Learning with Spark

Credits

About the Authors

About the Reviewer

www.Packtpub.com

Preface

Free Chapter

Introduction to Data Analytics with Spark

Spark overview

New computing paradigm with Spark

Spark ecosystem

Spark machine learning libraries

Installing and getting started with Spark

Packaging your application with dependencies

Running a sample machine learning application

References

Summary

Machine Learning Best Practices

What is machine learning?

Machine learning tasks

Practical machine learning problems

Most widely used machine learning problems

Large scale machine learning APIs in Spark

Practical machine learning best practices

Choosing the right algorithm for your application

Summary

Understanding the Problem by Understanding the Data

Analyzing and preparing your data

Resilient Distributed Dataset basics

Dataset basics

Dataset from string and typed class

Spark and data scientists workflow

Deeper into Spark

Summary

Extracting Knowledge through Feature Engineering

The state of the art of feature engineering

Best practices in feature engineering

Feature engineering with Spark

Advanced feature engineering

Summary

Supervised and Unsupervised Learning by Examples

Machine learning classes

Supervised learning with Spark - an example

Unsupervised learning

Recommender system

Advanced learning and generalizations

Summary

Building Scalable Machine Learning Pipelines

Spark machine learning pipeline APIs

Cancer-diagnosis pipeline with Spark

Cancer-prognosis pipeline with Spark

Market basket analysis with Spark Core

OCR pipeline with Spark

Topic modeling using Spark MLlib and ML

Credit risk analysis pipeline with Spark

Scaling the ML pipelines

Tips and performance considerations

Summary

Tuning Machine Learning Models

Details about machine learning model tuning

Typical challenges in model tuning

Evaluating machine learning models

Validation and evaluation techniques

Parameter tuning for machine learning models

Hypothesis testing

Machine learning model selection

Summary

Adapting Your Machine Learning Models

Adapting machine learning models

The generalization of ML models

Adapting through incremental algorithms

Adapting through reusing ML models

Machine learning in dynamic environments

Summary

Advanced Machine Learning with Streaming and Graph Data

Developing real-time ML pipelines

Time series and social network analysis

Movie recommendation using Spark

Developing a real-time ML pipeline from streaming

ML pipeline on graph data and semi-supervised graph-based learning

Summary

Configuring and Working with External Libraries

Third-party ML libraries with Spark

Using external libraries with Spark Core

Time series analysis using the Cloudera Spark-TS package

Configuring SparkR with RStudio

Configuring Hadoop run-time on Windows

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Scaling the ML pipelines

Data mining and machine learning algorithms impose outstanding challenges on parallel and distributed computing platforms. Furthermore, parallelizing the machine learning algorithms is highly task-specific and often depends on the preceding questions. In Chapter 1, Introduction to Data Analytics with Spark, we discussed and showed how to deploy the same machine learning application on top of a cluster or cloud computing infrastructure (that is, Amazon AWS/EC2).

Following that method, we can handle datasets with enormous batch sizes or in real time. In addition to this, scaling up the machine learning applications evolves another trade-off such as cost, complexity, run-time, and technical requirements. Furthermore, making task-appropriate algorithm and platform choices for large-scale machine learning requires an understanding of the benefits, trade-offs, and constraints of the available options.

To handle these issues, in this section, we will provide some theoretical...

Large Scale Machine Learning with Spark

By : Md. Rezaul Karim, Md. Mahedi Kaysar

Large Scale Machine Learning with Spark

By: Md. Rezaul Karim, Md. Mahedi Kaysar

Overview of this book

Related Content you might be interested in

Current Title:

Large Scale Machine Learning with Spark

Scaling the ML pipelines