Large Scale Machine Learning with Spark

Large Scale Machine Learning with Spark

By : Md. Rezaul Karim, Md. Mahedi Kaysar

Buy this Book

Large Scale Machine Learning with Spark

By: Md. Rezaul Karim, Md. Mahedi Kaysar

Buy this Book

Overview of this book

Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application. Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce.This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications. This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you’ll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark. In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.

Large Scale Machine Learning with Spark

Credits

About the Authors

About the Reviewer

www.Packtpub.com

Preface

Free Chapter

Introduction to Data Analytics with Spark

Spark overview

New computing paradigm with Spark

Spark ecosystem

Spark machine learning libraries

Installing and getting started with Spark

Packaging your application with dependencies

Running a sample machine learning application

References

Summary

Machine Learning Best Practices

What is machine learning?

Machine learning tasks

Practical machine learning problems

Most widely used machine learning problems

Large scale machine learning APIs in Spark

Practical machine learning best practices

Choosing the right algorithm for your application

Summary

Understanding the Problem by Understanding the Data

Analyzing and preparing your data

Resilient Distributed Dataset basics

Dataset basics

Dataset from string and typed class

Spark and data scientists workflow

Deeper into Spark

Summary

Extracting Knowledge through Feature Engineering

The state of the art of feature engineering

Best practices in feature engineering

Feature engineering with Spark

Advanced feature engineering

Summary

Supervised and Unsupervised Learning by Examples

Machine learning classes

Supervised learning with Spark - an example

Unsupervised learning

Recommender system

Advanced learning and generalizations

Summary

Building Scalable Machine Learning Pipelines

Spark machine learning pipeline APIs

Cancer-diagnosis pipeline with Spark

Cancer-prognosis pipeline with Spark

Market basket analysis with Spark Core

OCR pipeline with Spark

Topic modeling using Spark MLlib and ML

Credit risk analysis pipeline with Spark

Scaling the ML pipelines

Tips and performance considerations

Summary

Tuning Machine Learning Models

Details about machine learning model tuning

Typical challenges in model tuning

Evaluating machine learning models

Validation and evaluation techniques

Parameter tuning for machine learning models

Hypothesis testing

Machine learning model selection

Summary

Adapting Your Machine Learning Models

Adapting machine learning models

The generalization of ML models

Adapting through incremental algorithms

Adapting through reusing ML models

Machine learning in dynamic environments

Summary

Advanced Machine Learning with Streaming and Graph Data

Developing real-time ML pipelines

Time series and social network analysis

Movie recommendation using Spark

Developing a real-time ML pipeline from streaming

ML pipeline on graph data and semi-supervised graph-based learning

Summary

Configuring and Working with External Libraries

Third-party ML libraries with Spark

Using external libraries with Spark Core

Time series analysis using the Cloudera Spark-TS package

Configuring SparkR with RStudio

Configuring Hadoop run-time on Windows

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Configuring Hadoop run-time on Windows

If you are developing your machine learning application on windows using Eclipse (as Maven project of course), probably you will face a problem since Spark expects that there is a runtime environment for Hadoop on Windows too.

More specifically, suppose you are running a Spark project written in Java with main class as JavaNaiveBayes_ML.java, then you will experience an IO exception saying that:

16/10/04 11:59:52 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

Figure 1: IO exception due to the missing Hadoop runtime

The reason is that by default Hadoop is developed for the Linux environment and if you are developing your Spark applications on windows platform, a bridge is required that will provide the Hadoop environment for the Hadoop runtime for Spark to be properly executed.

Now, how to get rid of this problem then? The solution is...

Large Scale Machine Learning with Spark

By : Md. Rezaul Karim, Md. Mahedi Kaysar

Large Scale Machine Learning with Spark

By: Md. Rezaul Karim, Md. Mahedi Kaysar

Overview of this book

Related Content you might be interested in

Current Title:

Large Scale Machine Learning with Spark

Configuring Hadoop run-time on Windows