Book Image

Learning PySpark

By : Tomasz Drabas, Denny Lee
Book Image

Learning PySpark

By: Tomasz Drabas, Denny Lee

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. This book will show you how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python environment for Spark. You will get familiar with the modules available in PySpark. You will learn how to abstract data with RDDs and DataFrames and understand the streaming capabilities of PySpark. Also, you will get a thorough overview of machine learning capabilities of PySpark using ML and MLlib, graph processing using GraphFrames, and polyglot persistence using Blaze. Finally, you will learn how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will have established a firm understanding of the Spark Python API and how it can be used to build data-intensive applications.
Table of Contents (20 chapters)
Learning PySpark
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Index

The spark-submit command


The entry point for submitting jobs to Spark (be it locally or on a cluster) is the spark-submit script. The script, however, allows you not only to submit the jobs (although that is its main purpose), but also kill jobs or check their status.

Note

Under the hood, the spark-submit command passes the call to the spark-class script that, in turn, starts a launcher Java application. For those interested, you can check the GitHub repository for Spark: https://github.com/apache/spark/blob/master/bin/sparksubmitt.

The spark-submit command provides a unified API for deploying apps on a variety of Spark supported cluster managers (such as Mesos or Yarn), thus relieving you from configuring your application for each of them separately.

On the general level, the syntax looks as follows:

spark-submit [options] <python file> [app arguments]

We will go through the list of all the options soon. The app arguments are the parameters you want to pass to your application.

Note

You...