Book Image

Apache Spark 2.x Machine Learning Cookbook

By : Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall
Book Image

Apache Spark 2.x Machine Learning Cookbook

By: Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall

Overview of this book

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
Table of Contents (20 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Two methods of ingesting and preparing a CSV file for processing in Spark


In this recipe, we explore reading, parsing, and preparing a CSV file for a typical ML program. A comma-separated values (CSV) file normally stores tabular data (numbers and text) in a plain text file. In a typical CSV file, each row is a data record, and most of the time, the first row is called the header row, which stores the field's identifier (more commonly referred to as a column name for the field). Each record of one or fields, separated by commas.

How to do it...

  1. The sample CSV data file is from movie ratings. The file can be retrieved at http://files.grouplens.org/datasets/movielens/ml-latest-small.zip.
  1. Once the file is extracted, we will use the ratings.csv file for our CSV program to load the data into Spark. The CSV files will look like the following:

userId

movieId

rating

timestamp

1

16

4

1217897793

1

24

1.5

1217895807

1

32

4

1217896246

1

47

4

1217896556

1

50

4

1217896523

1

110

4

1217896150

1

150

3

1217895940

1

161

4

1217897864

1

165

3

1217897135...