Book Image

PySpark Cookbook

By : Denny Lee, Tomasz Drabas
Book Image

PySpark Cookbook

By: Denny Lee, Tomasz Drabas

Overview of this book

Apache Spark is an open source framework for efficient cluster computing with a strong interface for data parallelism and fault tolerance. The PySpark Cookbook presents effective and time-saving recipes for leveraging the power of Python and putting it to use in the Spark ecosystem. You’ll start by learning the Apache Spark architecture and how to set up a Python environment for Spark. You’ll then get familiar with the modules available in PySpark and start using them effortlessly. In addition to this, you’ll discover how to abstract data with RDDs and DataFrames, and understand the streaming capabilities of PySpark. You’ll then move on to using ML and MLlib in order to solve any problems related to the machine learning capabilities of PySpark and use GraphFrames to solve graph-processing problems. Finally, you will explore how to deploy your applications to the cloud using the spark-submit command. By the end of this book, you will be able to use the Python API for Apache Spark to solve any problems associated with building data-intensive applications.
Table of Contents (13 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

Index

A

  • actions, DataFrame
    • about / Overview of DataFrame actions
    • .show(...) action / The .show(...) action
    • .collect() action / The .collect() action
    • .take(...) action / The .take(...) action
    • .toPandas() action / The .toPandas() action
    • reference / See also
  • Advanced Packaging Tool (APT) / How it works...
  • Anaconda
    • reference / Installing Python
  • analysis of variance (ANOVA) / How to do it...
  • apt-get tool / How it works...

B

  • benchmark logistic regression model
    • forest coverage types, predicting / Predicting forest coverage types, How it works..., There's more...
  • binaries
    • Spark, installing / Installing Spark from binaries, How to do it..., How it works...
  • Bokeh
    • references / There's more...
  • breadth first search (BFS) algorithm / Finding the fewest number of connections

C

  • .collect() action / The .collect() action
  • Catalyst Optimizer / Introduction
  • categorical features
    • exploring / Categorical features
    • reference / See also...
  • census respondents
    • hours of work, predicting / Predicting hours of work for census respondents, How it works...
    • income levels, forecasting / Forecasting the income levels of census respondents, How it works..., There's more...
  • chi-square test
    • reference / How it works...
  • classification
    • about / How to do it...
    • Linear discriminant analysis (LDA) / How to do it...
    • LinearSVC / Introducing Estimators
    • LogisticRegression / Introducing Estimators
    • DecisionTreeClassifier / Introducing Estimators
    • GBTClassifier / Introducing Estimators
    • RandomForestClassifier / Introducing Estimators
    • NaiveBayes / Introducing Estimators
    • MultilayerPerceptronClassifier / Introducing Estimators
    • OneVsRest / Introducing Estimators
  • classification metrics
    • reference / How it works...
  • Cloudera
    • about / Working with Cloudera Spark images
    • virtual image, using / Working with Cloudera Spark images
  • Cloudera QuickStart VM
    • reference / How to do it...
    • downloading / How to do it...
    • configuring / How it works...
  • clustering
    • BisectingKMeans / Introducing Estimators
    • Kmeans / Introducing Estimators
    • GaussianMixture / Introducing Estimators
    • Latent Dirichlet Allocation (LDA) / Introducing Estimators
    • about / Clustering forest cover types
    • reference / See also
  • clustering models
    • building / Building a clustering models, There's more...
  • collinearity / Selecting the most predictable features
  • configuration options, Spark
    • reference / See also
  • continuous aggregation
    • with structured streaming / Continuous aggregation with structured streaming, How to do it..., How it works...
  • continuous variables
    • discretizing / Discretizing continuous variables, How it works...
    • standardizing / Standardizing continuous variables, How it works...
  • correlation methods
    • reference / There's more...
  • correlations
    • computing / Computing correlations, There's more...
  • CSV
    • data, reading from / From CSV

D

  • .describe() transformation / The .summary() and .describe() transformations
  • .distinct(...) transformation / The .distinct(...) transformation
  • .dropDuplicates(...) transformation / The .dropDuplicates(...) transformation
  • .dropna(...) transformation / The .dropna(...) transformation
  • D3.js visualization
    • reference / Visualizing the graph
  • data
    • reading, from files / Reading data from files, How to do it...
    • loading / Loading the data, How it works..., There's more...
    • exploring / Exploring the data, How it works..., There's more...
    • testing / Testing the data, Getting ready, How it works...
    • transforming / Transforming the data, How it works..., There's more...
    • standardizing / Standardizing the data, How it works...
  • Databricks
    • reference / Introduction, How to do it...
  • data formats
    • reference / See also
  • DataFrame
    • creating / Creating DataFrames, How it works...
    • data, reading from JSON / From JSON
    • data, reading from CSV / From CSV
    • used, for performance optimization / Performance optimizations, Getting ready, How it works..., There's more...
    • interacting, with SQL / Using SQL to interact with DataFrames, How it works..., There's more...
    • transformations / Overview of DataFrame transformations, How to do it...
    • actions / Overview of DataFrame actions
  • data splitting
    • reference / See also
  • descriptive statistics
    • exploring / Exploring descriptive statistics, How it works...
    • exploring, for aggregated columns / Descriptive statistics for aggregated columns
    • reference link / See also
  • Discrete Cosine Transform (DCT) / Introducing Transformers
  • Discretized Streams (DStreams) / Understanding Spark Streaming, Understanding DStreams
  • duplicated records, in full_removed DataFrame
    • different IDs / Only IDs differ
    • ID collisions / ID collisions
  • duplicates
    • handling / Handling duplicates, How it works...

E

  • estimators
    • about / Introducing Estimators
    • linear SVM model / Introducing Estimators
    • linear regression model / Introducing Estimators
    • classification / Introducing Estimators
    • regression / Introducing Estimators
    • clustering / Introducing Estimators
  • evaluation of clustering
    • reference / See also

F

  • .fillna(...) transformation / The .fillna(...) transformation
  • .filter(...) transformation / The .filter(...) transformation
  • .freqItems(...) transformation / The .freqItems(...) transformation
  • features
    • interactions, visualizing / Visualizing interactions between features, How it works..., There's more...
    • extracting, from text / Extracting features from text, How it works..., There's more...
  • features, regression
    • Pearson's correlation / How to do it...
    • analysis of variance (ANOVA) / How to do it...
  • feature selection
    • reference / See also
  • files
    • data, reading / Reading data from files, How to do it...
    • .textFile(...) method / .textFile(...) method
    • .map(...) method / .map(...) method
    • performance / Partitions and performance
    • partitions / Partitions and performance
  • forest coverage types
    • predicting, with benchmark logistic regression model / Predicting forest coverage types, How it works..., There's more...
    • predicting, with random forest classifier / Predicting forest coverage types, How it works..., There's more...
    • clustering / Clustering forest cover types, How it works...
  • forest elevation
    • predicting, with linear SVM model / Getting ready, How it works..., There's more...
    • predicting, with linear regression model / Getting ready, How it works..., There's more...
    • estimating, with gradient-boosted trees regressor / Estimating forest elevation, How it works..., There's more...
    • estimating with random forest regression model / Estimating forest elevation, How it works..., There's more...

G

  • .groupBy(...) transformation / The .groupBy(...) transformation
  • Generalized Linear Model (GLM) / Introducing Estimators
  • global aggregations
    • about / Understanding global aggregations
    • implementing / Understanding global aggregations, How it works...
    • Netcat window / Terminal 1 – Netcat window
    • Spark Streaming window / Terminal 2 – Spark Streaming window
  • Gradient-Boosted Trees (GBT) / Introducing Estimators
  • gradient-boosted trees regressor
    • forest elevation, estimating / Estimating forest elevation, How it works..., There's more...
  • gradient descent
    • reference / How it works...
  • graph
    • about / Introduction
    • usage / Introduction
    • data, preparing / Preparing the data, How to do it..., How it works..., There's more...
    • building / Building the graph, How to do it..., How it works...
    • queries, executing / Running queries against the graph, How to do it..., How it works...
    • patterns, understanding with motifs / Understanding the graph, How it works...
    • fewest number of connections, searching / Finding the fewest number of connections, How it works..., There's more...
    • visualizing / Visualizing the graph, How to do it..., How it works...
  • GraphFrames
    • installing / Installing GraphFrames, Getting ready, How it works...
    • improvements / Installing GraphFrames
    • reference / Installing GraphFrames, How to do it...
  • GraphFrames Spark package
    • reference / How it works...

H

  • histograms
    • drawing / Drawing histograms, How it works..., There's more...
    • reference / See also
  • hyperparameters
    • tuning / Tuning hyperparameters, How it works..., There's more...

I

  • internal field separator (IFS) / How it works...
  • interquartile range / How to do it...
  • Inverse Document Frequency (IDF) / Introducing Transformers

J

  • .join(...) transformation / The .join(...) transformation
  • Java
    • installing / Installing Java
    • reference / Installing Java
  • JSON
    • data, reading from / From JSON
  • Jupyter
    • installing / Installing Jupyter, How it works..., There's more..., See also
    • reference / How it works...
    • session, configuring / Configuring a session in Jupyter, Getting ready, How it works..., There's more...
  • Jupyter kernel
    • reference / How it works...
  • Jupyter Notebook / How it works...

K

  • kernel
    • about / How it works...
    • reference / How it works...

L

  • Latent Dirichlet Allocation (LDA) / Introducing Estimators, How it works...
  • Linear discriminant analysis (LDA) / How to do it...
  • linear regression model
    • for predicting forest elevation / Getting ready, How it works..., There's more...
  • linear SVM model
    • for predicting forest elevation / Getting ready, How it works..., There's more...
  • Livy REST API
    • reference / See also
  • local instance, Spark
    • configuring / Configuring a local instance of Spark, How it works...
    • spark.app.name parameter / How it works...
    • spark.driver.cores parameter / How it works...
    • spark.driver.memory parameter / How it works...
    • spark.executor.cores parameter / How it works...
    • spark.executor.memory parameter / How it works...
    • spark.submit.pyFiles parameter / How it works...
    • spark.submit.deployMode parameter / How it works...
    • spark.pyspark.python parameter / How it works...

M

  • machine learning (ML)
    • about / Transforming the data
    • reference / How it works...
  • machine learning algorithms
    • evaluation metrics, reference / See also
  • Matplotlib
    • reference link / How it works...
  • Maven
    • reference / Installing Maven
    • installing / Installing Maven
  • missing observations
    • handling / Handling missing observations, How to do it..., There's more...
  • missing observations per column
    • handling / Missing observations per column
  • missing observations per row
    • handling / Missing observations per row
  • motifs
    • used, for understanding patterns in graph / Understanding the graph, How it works...
  • multi-node instance, Spark
    • configuring / Configuring a multi-node instance of Spark, Getting ready, How to do it..., How it works...
  • multicollinearity
    • reference / There's more...

N

  • Netcat
    • starting / Terminal 1 – Netcat window
  • numerical features
    • exploring / Numerical features

O

  • .orderBy(...) transformation / The .orderBy(...) transformation
  • on-time flight performance data
    • reference / Preparing the data
  • outliers
    • handling / How it works...
    • reference link / See also

P

  • PageRank
    • about / Using PageRank to determine airport ranking
    • reference / Using PageRank to determine airport ranking, How it works...
    • used, for determining airport ranking / Using PageRank to determine airport ranking, How it works...
  • PATH
    • updating / Updating PATH
  • Pearson's correlation / How to do it...
  • performance statistics
    • computing / Computing performance statistics
    • computing, with regression metrics / Regression metrics
    • computing, with classification metrics / Classification metrics
  • pip
    • reference / There's more...
    • about / There's more..., How it works...
  • Pipelines
    • about / Introducing Pipelines
    • using / How to do it..., How it works...
    • reference / See also
  • Precision-Recall (PR) / Classification metrics
  • predictable features
    • selecting / Selecting the most predictable features, How it works..., There's more...
    • correlations, checking / There's more...
  • problems, dataset
    • duplicated observations / Introduction
    • missing observations / Introduction
    • aanomalous observations / Introduction
    • encoding / Introduction
    • untrustworthy answers / Introduction
  • PySpark
    • installing, with pip / There's more...
  • pyspark.sql module
    • reference / How it works...
  • Python
    • installing / Installing Python
  • Python Package Index (PyPI)
    • about / How it works...
    • reference / See also

R

  • .repartition(...) transformation / The .repartition(...) transformation
  • R
    • reference / Installing R
    • installing / Installing R
  • random forest classifier
    • forest coverage types, predicting / Predicting forest coverage types, How it works..., There's more...
  • random forest regression model
    • forest elevation, estimating / Estimating forest elevation, How it works..., There's more...
  • RDD actions
    • overview / Overview of RDD actions, How to do it...
    • .take(...) action / .take(...) action
    • .collect() action / .collect() action
    • .reduce(...) action / .reduce(...) action
    • .count() action / .count() action
    • .saveAsTextFile(...) action / .saveAsTextFile(...) action
    • implementing / How it works...
  • RDD transformations
    • overview / Overview of RDD transformations, How to do it...
    • reference / How to do it...
    • .map(...) transformation / .map(...) transformation
    • .filter(...) transformation / .filter(...) transformation
    • .flatMap(...) transformation / .flatMap(...) transformation
    • .distinct() transformation / .distinct() transformation
    • .sample(...) transformation / .sample(...) transformation
    • .join(...) transformation / .join(...) transformation
    • .repartition(...) transformation / .repartition(...) transformation
    • .zipWithIndex() transformation / .zipWithIndex() transformation
    • .reduceByKey(...) transformation / .reduceByKey(...) transformation
    • .sortByKey(...) transformation / .sortByKey(...) transformation
    • .union(...) transformation / .union(...) transformation
    • .mapPartitionsWithIndex(...) transformation / .mapPartitionsWithIndex(...) transformation
    • implementing / How it works...
  • Receiver Operating Characteristics (ROC) / Classification metrics
  • redirection pipes
    • reference / How it works...
  • reflection
    • schema, inferring / Inferring the schema using reflection, How it works..., See also
  • regression
    • about / How to do it...
    • AFTSurvivalRegression / Introducing Estimators
    • DecisionTreeRegressor / Introducing Estimators
    • GBTRegressor / Introducing Estimators
    • GeneralizedLinearRegression / Introducing Estimators
    • IsotonicRegression / Introducing Estimators
    • LinearRegression / Introducing Estimators
    • RandomForestRegressor / Introducing Estimators
  • regression metrics
    • reference / How it works...
  • Resilient Distributed Datasets (RDDs)
    • about / Introduction
    • creating / Creating RDDs, How it works...
    • Spark context parallelize method / Spark context parallelize method
    • .take(...) method / .take(...) method
    • pitfalls / Pitfalls of using RDDs, How to do it..., How it works...
    • accessing / Accessing underlying RDDs, How it works...
    • creating, for training / Creating an RDD for training, How it works..., There's more...
    • for classification / Classification
    • for regression / Regression

S

  • .select(...) transformation / The .select(...) transformation
  • .show(...) action / The .show(...) action
  • .summary() transformation / The .summary() and .describe() transformations
  • Scala
    • installing / Installing Scala
  • schema
    • inferring, with reflection / Inferring the schema using reflection, How it works..., See also
    • reference / See also, See also
    • specifying, programmatically / Specifying the schema programmatically, How it works..., See also
  • silhouette metrics
    • reference / How it works...
  • skip-gram model
    • reference / There's more...
  • Spark
    • about / Introduction
    • features / Introduction
    • requisites, installing / Installing Spark requirements, Getting ready, How it works...
    • reference / How it works..., See also
    • Java, installing / Installing Java
    • Python, installing / Installing Python
    • R, installing / Installing R
    • Scala, installing / Installing Scala
    • Maven, installing / Installing Maven
    • PATH, updating / Updating PATH
    • installing, from sources / Installing Spark from sources, How to do it..., How it works..., See also
    • installing, from binaries / Installing Spark from binaries, How to do it..., How it works...
    • local instance, configuring / Configuring a local instance of Spark, How it works...
    • multi-node instance, configuring / Configuring a multi-node instance of Spark, Getting ready, How to do it..., How it works...
  • Spark DataFrame / Creating DataFrames
  • sparkmagic package
    • installing / How it works...
    • reference / How it works..., See also
  • Spark Streaming
    • about / Understanding Spark Streaming
    • reference / How to do it...
    • Netcat window, using / Terminal 1 – Netcat window
    • PySpark Streaming application, creating / Terminal 2 – Spark Streaming window
    • console application, creating / How it works..., There's more...
  • Spark Streaming Context (SSC) / Understanding Spark Streaming
  • SQL
    • used, for interacting with DataFrame / Using SQL to interact with DataFrames, How it works..., There's more...
  • statistical test
    • reference / See also...
  • stop words
    • reference / How it works...
  • structured streaming
    • continuous aggregation / Continuous aggregation with structured streaming, How to do it..., How it works...
    • Netcat window / Terminal 1 – Netcat window
    • Spark Streaming window / Terminal 2 – Spark Streaming window
  • SVM ( Support Vector Machine) / Forecasting the income levels of census respondents

T

  • .take(...) action / The .take(...) action
  • .toPandas() action / The .toPandas() action
  • temporary table
    • creating / Creating a temporary table, How it works..., There's more...
  • term frequency-inverse document frequency (TF-IDF) / How it works...
  • text
    • features, extracting / Extracting features from text, How it works..., There's more...
  • topic
    • assigning, to set of short paragraphs / Topic mining, How it works...
  • train-validation split / There's more...
  • transformations, DataFrame
    • about / Overview of DataFrame transformations, How to do it...
    • .select(...) transformation / The .select(...) transformation
    • .filter(...) transformation / The .filter(...) transformation
    • .groupBy(...) transformation / The .groupBy(...) transformation
    • .orderBy(...) transformation / The .orderBy(...) transformation
    • .withColumn(...) transformation / The .withColumn(...) transformation
    • .join(...) transformation / The .join(...) transformation
    • .unionAll(...) transformation / The .unionAll(...) transformation
    • .distinct(...) transformation / The .distinct(...) transformation
    • .repartition(...) transformation / The .repartition(...) transformation
    • .fillna(...) transformation / The .fillna(...) transformation
    • .dropna(...) transformation / The .dropna(...) transformation
    • .dropDuplicates(...) transformation / The .dropDuplicates(...) transformation
    • .describe() transformation / The .summary() and .describe() transformations
    • .summary() transformation / The .summary() and .describe() transformations
    • .freqItems(...) transformation / The .freqItems(...) transformation
    • reference / See also
  • transformers
    • about / Introducing Transformers
    • Binarizer / Introducing Transformers
    • Bucketizer / Introducing Transformers
    • ChiSqSelector / Introducing Transformers
    • CountVectorizer / Introducing Transformers
    • DCT / Introducing Transformers
    • ElementwiseProduct / Introducing Transformers
    • HashingTF / Introducing Transformers
    • IDF / Introducing Transformers
    • IndexToString / Introducing Transformers
    • MaxAbsScaler / Introducing Transformers
    • MinMaxScaler / Introducing Transformers
    • NGram / Introducing Transformers
    • Normalizer / Introducing Transformers
    • OneHotEncoder / Introducing Transformers
    • PCA / Introducing Transformers
    • PolynomialExpansion / Introducing Transformers
    • QuantileDiscretizer / Introducing Transformers
    • RegexTokenizer / Introducing Transformers
    • RFormula / Introducing Transformers
    • SQLTransformer / Introducing Transformers
    • StandardScaler / Introducing Transformers
    • StopWordsRemover / Introducing Transformers
    • StringIndexer / Introducing Transformers
    • Tokenizer / Introducing Transformers
    • VectorAssembler / Introducing Transformers
    • VectorIndexer / Introducing Transformers
    • VectorSlicer / Introducing Transformers
    • Word2Vec / Introducing Transformers
    • using / Getting ready, How it works...
    • .VectorAssembler(...) method / There's more...
    • reference / See also

U

  • .unionAll(...) transformation / The .unionAll(...) transformation
  • User Defined Functions (UDFs) / Performance optimizations

V

  • vectorized UDFs
    • reference / See also
  • VirtualBox
    • reference / Getting ready
    • installation / Getting ready

W

  • .withColumn(...) transformation / The .withColumn(...) transformation
  • Word2Vec
    • reference / There's more...