Book Image

Machine Learning with Apache Spark Quick Start Guide

By : Jillur Quddus
Book Image

Machine Learning with Apache Spark Quick Start Guide

By: Jillur Quddus

Overview of this book

Every person and every organization in the world manages data, whether they realize it or not. Data is used to describe the world around us and can be used for almost any purpose, from analyzing consumer habits to fighting disease and serious organized crime. Ultimately, we manage data in order to derive value from it, and many organizations around the world have traditionally invested in technology to help process their data faster and more efficiently. But we now live in an interconnected world driven by mass data creation and consumption where data is no longer rows and columns restricted to a spreadsheet, but an organic and evolving asset in its own right. With this realization comes major challenges for organizations: how do we manage the sheer size of data being created every second (think not only spreadsheets and databases, but also social media posts, images, videos, music, blogs and so on)? And once we can manage all of this data, how do we derive real value from it? The focus of Machine Learning with Apache Spark is to help us answer these questions in a hands-on manner. We introduce the latest scalable technologies to help us manage and process big data. We then introduce advanced analytical algorithms applied to real-world use cases in order to uncover patterns, derive actionable insights, and learn from this big data.
Table of Contents (10 chapters)

What this book covers

Chapter 1, The Big Data Ecosystem, provides an introduction to the current big data ecosystem. With the multitude of on-premises and cloud-based technologies, tools, services, libraries, and frameworks available in the big data, artificial intelligence, and machine learning space (and growing every day!), it is vitally important to understand the logical function of each layer within the big data ecosystem so that we may understand how they integrate with each other in order to ultimately architect and engineer end-to-end data intelligence and machine learning pipelines. This chapter also provides a logical introduction to Apache Spark within the context of the wider big data ecosystem.

Chapter 2, Setting Up a Local Development Environment, provides a detailed and hands-on guide to installing, configuring, and deploying a local Linux-based development environment on your personal desktop, laptop, or cloud-based infrastructure. You will learn how to install and configure all the software services required for this book in one self-contained location, including installing and configuring prerequisite programming languages (Java JDK 8 and Python 3), a distributed data processing and analytics engine (Apache Spark 2.3), a distributed real-time streaming platform (Apache Kafka 2.0), and a web-based notebook for interactive data insights and analytics (Jupyter Notebook).

Chapter 3, Artificial Intelligence and Machine Learning, provides a concise theoretical summary of the various applied subjects that fall under the artificial intelligence field of study, including machine learning, deep learning, and cognitive computing. This chapter also provides a logical introduction into how end-to-end data intelligence and machine learning pipelines may be architected and engineered using Apache Spark and its machine learning library, MLlib.

Chapter 4, Supervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of supervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly used classification and regression techniques including linear regression, logistic regression, classification and regression trees (CART), and random forests.

Chapter 5, Unsupervised Learning Using Apache Spark, provides a hands-on guide to engineering, training, validating, and interpreting the results of unsupervised machine learning algorithms using Apache Spark through real-world use-cases. The chapter describes and implements commonly-used unsupervised techniques including hierarchical clustering, K-means clustering, and dimensionality reduction via Principal Component Analysis (PCA).

Chapter 6, Natural Language Processing Using Apache Spark, provides a hands-on guide to engineering natural language processing (NLP) pipelines using Apache Spark through real-world use-cases. The chapter describes and implements commonly used NLP techniques including tokenisation, stemming, lemmatization, normalization, and other feature transformers, and feature extractors such as the bag of words and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms.

Chapter 7, Deep Learning Using Apache Spark, provides a hands-on exploration of the exciting and cutting-edge world of deep learning! The chapter uses third-party deep learning libraries in conjunction with Apache Spark to train and interpret the results of Artificial Neural Networks (ANNs) including Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) applied to real-world use-cases.

Chapter 8, Real-Time Machine Learning Using Apache Spark, extends the deployment of machine learning models beyond batch processing in order to learn from data, make predictions, and identify trends in real-time! The chapter provides a hands-on guide to engineering and deploying real-time stream processing and machine learning pipelines using Apache Spark and Apache Kafka to transport, transform, and analyze data streams as they are being created around the world.