Book Image

Large Scale Machine Learning with Spark

By : Md. Rezaul Karim, Md. Mahedi Kaysar
Book Image

Large Scale Machine Learning with Spark

By: Md. Rezaul Karim, Md. Mahedi Kaysar

Overview of this book

<p>Data processing, implementing related algorithms, tuning, scaling up and finally deploying are some crucial steps in the process of optimising any application.</p> <p>Spark is capable of handling large-scale batch and streaming data to figure out when to cache data in memory and processing them up to 100 times faster than Hadoop-based MapReduce.This means predictive analytics can be applied to streaming and batch to develop complete machine learning (ML) applications a lot quicker, making Spark an ideal candidate for large data-intensive applications.</p> <p>This book focuses on design engineering and scalable solutions using ML with Spark. First, you will learn how to install Spark with all new features from the latest Spark 2.0 release. Moving on, you’ll explore important concepts such as advanced feature engineering with RDD and Datasets. After studying developing and deploying applications, you will see how to use external libraries with Spark.</p> <p>In summary, you will be able to develop complete and personalised ML applications from data collections,model building, tuning, and scaling up to deploying on a cluster or the cloud.</p>
Table of Contents (16 chapters)
Large Scale Machine Learning with Spark
Credits
About the Authors
About the Reviewer
www.Packtpub.com
Preface

Configuring Hadoop run-time on Windows


If you are developing your machine learning application on windows using Eclipse (as Maven project of course), probably you will face a problem since Spark expects that there is a runtime environment for Hadoop on Windows too.

More specifically, suppose you are running a Spark project written in Java with main class as JavaNaiveBayes_ML.java, then you will experience an IO exception saying that:

16/10/04 11:59:52 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

Figure 1: IO exception due to the missing Hadoop runtime

The reason is that by default Hadoop is developed for the Linux environment and if you are developing your Spark applications on windows platform, a bridge is required that will provide the Hadoop environment for the Hadoop runtime for Spark to be properly executed.

Now, how to get rid of this problem then? The solution is...