Learning PySpark

Learning PySpark

By : Tomasz Drabas, Denny Lee

Buy this Book

Learning PySpark

By: Tomasz Drabas, Denny Lee

Buy this Book

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.

Learning PySpark

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Understanding Spark

What is Apache Spark?

Spark Jobs and APIs

Spark 2.0 architecture

Summary

Resilient Distributed Datasets

Internal workings of an RDD

Creating RDDs

Global versus local scope

Transformations

Actions

Summary

DataFrames

Python to RDD communications

Catalyst Optimizer refresh

Speeding up PySpark with DataFrames

Creating DataFrames

Simple DataFrame queries

Interoperating with RDDs

Querying with the DataFrame API

Querying with SQL

DataFrame scenario – on-time flight performance

Spark Dataset API

Summary

Prepare Data for Modeling

Checking for duplicates, missing observations, and outliers

Getting familiar with your data

Visualization

Summary

Introducing MLlib

Overview of the package

Loading and transforming the data

Getting to know your data

Creating the final dataset

Predicting infant survival

Summary

Introducing the ML Package

Overview of the package

Predicting the chances of infant survival with ML

Parameter hyper-tuning

Other features of PySpark ML in action

Summary

GraphFrames

Introducing GraphFrames

Installing GraphFrames

Preparing your flights dataset

Building the graph

Executing simple queries

Understanding vertex degrees

Determining the top transfer airports

Understanding motifs

Determining airport ranking using PageRank

Determining the most popular non-stop flights

Using Breadth-First Search

Visualizing flights using D3

Summary

TensorFrames

What is Deep Learning?

What is TensorFlow?

Introducing TensorFrames

TensorFrames – quick start

Summary

Polyglot Persistence with Blaze

Summary

Structured Streaming

What is Spark Streaming?

Why do we need Spark Streaming?

What is the Spark Streaming application data flow?

Simple streaming application using DStreams

A quick primer on global aggregations

Introducing Structured Streaming

Summary

Packaging Spark Applications

The spark-submit command

Deploying the app programmatically

Databricks Jobs

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Other features of PySpark ML in action

At the beginning of this chapter, we described most of the features of the PySpark ML library. In this section, we will provide examples of how to use some of the Transformers and Estimators.

Feature extraction

We have used quite a few models from this submodule of PySpark. In this section, we'll show you how to use the most useful ones (in our opinion).

NLP - related feature extractors

As described earlier, the NGram model takes a list of tokenized text and produces pairs (or n-grams) of words.

In this example, we will take an excerpt from PySpark's documentation and present how to clean up the text before passing it to the NGram model. Here's how our dataset looks like (abbreviated for brevity):

Tip

For the full view of how the following snippet looks like, please download the code from our GitHub repository: https://github.com/drabastomek/learningPySpark.

We copied these four paragraphs from the description of the DataFrame usage in Pipelines: http://spark...

Learning PySpark

By : Tomasz Drabas, Denny Lee

Learning PySpark

By: Tomasz Drabas, Denny Lee

Overview of this book

Related Content you might be interested in

Current Title:

Learning PySpark

PySpark Cookbook

Learning Spark SQL

Apache Spark Quick Start Guide

Other features of PySpark ML in action

Feature extraction

NLP - related feature extractors

Tip