Book Image

Apache Spark 2.x Machine Learning Cookbook

By : Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall
Book Image

Apache Spark 2.x Machine Learning Cookbook

By: Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall

Overview of this book

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
Table of Contents (20 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Introduction


Understanding how optimization works is fundamental for a successful career in machine learning. We picked the Gradient Descent (GD) method for an end-to-end deep dive to demonstrate the inner workings of an optimization technique. We will develop the concept using three recipes that walk the developer from scratch to a fully developed code to solve an actual problem with real-world data. The fourth recipe explores an alternative to GD using Spark and normal equations (limited scaling for big data problems) to solve a regression problem.

Let's get started. How does a machine learn anyway? Does it really learn from its mistakes? What does it mean when the machine finds a solution using optimization?

At a high level, machines learn based on one of the following five techniques:

  • Error based learning: In this technique, we search the domain space for a combination of parameter values (weights) that minimize the total error (predicted versus actual) over the training data.
  • Information...