Book Image

Learning PySpark

By : Tomasz Drabas, Denny Lee
Book Image

Learning PySpark

By: Tomasz Drabas, Denny Lee

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.
Table of Contents (20 chapters)
Learning PySpark
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Index

Predicting infant survival


Finally, we can move to predicting the infants' survival chances. In this section, we will build two models: a linear classifier—the logistic regression, and a non-linear one—a random forest. For the former one, we will use all the features at our disposal, whereas for the latter one, we will employ a ChiSqSelector(...) method to select the top four features.

Logistic regression in MLlib

Logistic regression is somewhat a benchmark to build any classification model. MLlib used to provide a logistic regression model estimated using a stochastic gradient descent (SGD) algorithm. This model has been deprecated in Spark 2.0 in favor of the LogisticRegressionWithLBFGS model.

The LogisticRegressionWithLBFGS model uses the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization algorithm. It is a quasi-Newton method that approximates the BFGS algorithm.

Note

For those of you who are mathematically adept and interested in this, we suggest perusing this blog post...