Book Image

Mastering Spark for Data Science

By : Andrew Morgan, Antoine Amend, Matthew Hallett, David George
Book Image

Mastering Spark for Data Science

By: Andrew Morgan, Antoine Amend, Matthew Hallett, David George

Overview of this book

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.
Table of Contents (22 chapters)
Mastering Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Preface

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at that level we need to be able to build data science solutions of substance; ones that solve real problems, and which can run reliably enough for people to trust and act upon.

This book explains how to use Spark to deliver production grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. Whilst writing this book it was the authors’ intention to deliver a work that transcends the traditional cookbook style: providing not just examples of code, but developing the techniques and mind-set that are needed to explore content like a master; as they say, Content is King! Readers will notice that the book has a heavy emphasis on news analytics, and occasionally pulls in other datasets such as Tweets and financial data. This emphasis on news is not an accident; much effort has been spent on trying to focus on datasets that offer context at a global scale.

The implicit problem that this book is dedicated to is the lack of data offering proper context around how and why people make decisions. Often, directly accessible data sources are very focused on problem specifics and, as a consequence, can be very light on broader datasets offering the behavioral context needed to really understand what’s driving the decisions that people make.

Considering a simple example where website users’ key information such as age, gender, location, shopping behavior, purchases and so on are known, we might use this data to recommend products based on what others “like them” have been buying.

But to be exceptional, more context is required as to why people behave as they do. When news reports suggest a massive Atlantic hurricane is approaching the Florida coastline, and could reach the coast in say 36 hours, perhaps we should be recommending products people might need. Items such as USB enabled battery packs for keeping phones charged, candles, flashlights, water purifiers, and the like. By understanding the context in which decisions are being made, we can conduct better science.

Therefore, whilst this book certainly contains useful code and, in many cases, unique implementations, it further dives deep into the techniques and skills required to truly master data science; some of which are often overlooked or not considered at all. Drawing on many years of commercial experience, the authors have leveraged their extensive knowledge to bring the real, and exciting world of data science to life.

What this book covers

Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.

Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.

Chapter 3, Input Formats and Schema, this chapter demonstrates how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. With this in mind, we will look at the traditionally well-understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark today. In addition, whilst honing our Spark skills we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

Chapter 4, Exploratory Data Analysis, a common misconception is that an EDA is only for discovering the statistical properties of a dataset and providing insights about how it can be exploited. In practice, this isn’t the full story. A full EDA will extend that idea, and include a detailed assessment of the “feasibility of using this Data Feed in production.” It requires us to also understand how we would specify a production grade data loading routine for this dataset, one that might potentially run in a “lights out mode” for many years. This chapter offers a rapid method for doing Data Quality assessment using a “data profiling” technique to accelerate the process.

Chapter 5, Spark for Geographic Analysis, geographic processing is a powerful new use case for Spark, and this chapter demonstrates how to get started. The aim of this chapter is to explain how Data Scientists can process geographic data, using Spark, to produce powerful map based views of very large datasets. We demonstrate how to process spatio-temporal datasets easily via Spark integrations with Geomesa, which helps turn Spark into a sophisticated geographic processing engine. The chapter later leverages this spatio-temporal data to apply machine learning with a view to predicting oil prices.

Chapter 6, Scraping Link-Based External Data, this chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs, such as GDELT and Twitter. We offer a tutorial using the GDELT news index service as a source of news URLS, demonstrating how to build a web scale News Scanner that scrapes global breaking news of interest from the internet. We further explain how to use the specialist web-scraping component in a way that overcomes the challenges of scale, followed by the summary of this chapter.

Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.

Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.

Chapter 9, News Dictionary and Real-Time Tagging System, while a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without a proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. In this chapter, we will be describing an innovative way of labeling incoming GDELT data in a non-supervised way and in near real time.

Chapter 10, Story De-duplication and Mutation, in this chapter, we de-duplicate and index the GDELT database into stories, before tracking stories over time and understanding the links between them, how they may mutate and if they could lead to any subsequent events in the near future. Core to this chapter is the concept of Simhash to detect near duplicates and building vectors to reduce dimensionality using Random Indexing.

Chapter 11, Anomaly Detection and Sentiment Analysis, perhaps the most notable occurrence of the year 2016 was the tense US presidential election and its eventual outcome: the election of President Donald Trump, a campaign that will long be remembered; not least for its unprecedented use of social media and the stirring up of passion among its users, most of whom made their feelings known through the use of hashtags. In this chapter, instead of trying to predict the outcome itself, we will aim to detect abnormal tweets during the US election using a real-time Twitter feed.

Chapter 12, TrendCalculus, long before the concept of “what’s trending” became a popular topic of study by data scientists, there was an older one that is still not well served by data science; it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people “eyeballing” time series charts and offering interpretations. But what is it that people’s eyes are doing? This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically: TrendCalculus.

Chapter 13, Secure Data, throughout this book we visit many areas of data science, often straying into those that are not traditionally associated with a data scientist’s core working knowledge. In this chapter we will visit another of those often overlooked fields, Secure Data; more specifically, how to protect your data and analytic results at all stages of the data life cycle. Core to this chapter is the construction of a commercial grade encryption codec for Spark.

Chapter 14, Scalable Algorithms, in this chapter we learn about why sometimes even basic algorithms, despite working at small scale, will often fail in “big data”. We’ll see how to avoid issues when writing Spark jobs that run over massive Datasets and will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data. The chapter features areas such as: parallelization strategies, caching, shuffle strategies, garbage collection optimization and probabilistic models; explaining how these can help you to get the most out of the Spark paradigm.

What you need for this book

Spark 2.0 is used throughout the book along with Scala 2.11, Maven and Hadoop. This is the basic environment required, there are many other technologies used which are introduced in the relevant chapters.

Who this book is for

We presume that the data scientists reading this book are knowledgeable about data science, common machine learning methods, and popular data science tools, and have in the course of their work run proof of concept studies, and built prototypes. We offer a book that introduces advanced techniques and methods for building data science solutions to this audience, showing them how to construct commercial grade data products.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the BeautifulSoup function."

A block of code is set as follows:

import org.apache.spark.sql.functions._      
 
val rdd = rawDS map GdeltParser.toCaseClass    
val ds = rdd.toDS()     
  
// DataFrame-style API 
ds.agg(avg("goldstein")).as("goldstein").show() 

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

spark.sql("SELECT V2GCAM FROM GKG LIMIT 5").show 
spark.sql("SELECT AVG(GOLDSTEIN) AS GOLDSTEIN FROM GKG WHERE GOLDSTEIN IS NOT NULL").show()

Any command-line input or output is written as follows:

$ cat 20150218230000.gkg.csv | gawk -F"\t" '{print $4}'

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to FilesSettings | Project Name | Project Interpreter."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit  http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Spark-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSparkforDataScience_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.