Book Image

Mastering Spark for Data Science

By : Andrew Morgan, Antoine Amend, Matthew Hallett, David George
Book Image

Mastering Spark for Data Science

By: Andrew Morgan, Antoine Amend, Matthew Hallett, David George

Overview of this book

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.
Table of Contents (22 chapters)
Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Chapter 2. Data Acquisition

As a data scientist, one of the most important tasks is to load data into your data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data. We walk through a configuration and demonstrate how it delivers vital feed management information under a variety of running conditions.

Readers will learn how to construct a content register and use it to track all input loaded to the system and to deliver metrics on ingestion pipelines, so that these flows can be reliably run as an automated, lights-out process.

In this chapter, we will cover the following topics:

  • Introduce the Global Database of Events, Language, and Tone (GDELT) dataset

  • Data pipelines

  • Universal ingestion framework

  • Real-time monitoring for new data

  • Receiving streaming data via Kafka

  • Registering new content and vaulting for tracking purposes

  • Visualization...