Structured API - Spark DataFrame | Apache Spark 3 for Data Engineering and Analytics with Python

Book Overview & Buying
Table Of Contents

Apache Spark 3 for Data Engineering and Analytics with Python

By : David Mngadi

5 (2)

Buy this Video

Apache Spark 3 for Data Engineering and Analytics with Python

5 (2)

By: David Mngadi

Buy this Video

Overview of this book

Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames. There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course. The code bundle for this course is available here: https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-

Introduction to Spark and Installation

Introduction

The Spark Architecture

The Spark Unified Stack

Java Installation

Hadoop Installation

Python Installation

PySpark Installation

Install Microsoft Build Tools

MacOS - Java Installation

MacOS - Python Installation

MacOS - PySpark Installation

MacOS - Testing the Spark Installation

Install Jupyter Notebooks

The Spark Web UI

Section Summary

Spark Execution Concepts

Section Introduction

Spark Application and Session

Spark Transformations and Actions Part 1

Spark Transformations and Actions Part 2

DAG Visualisation

RDD Crash Course

Introduction to RDDs

Data Preparation

Distinct and Filter Transformations

Map and Flat Map Transformations

SortByKey Transformations

RDD Actions

Challenge - Convert Fahrenheit to Centigrade

Challenge - XYZ Research

Challenge - XYZ Research Part 1

Challenge XYZ Research Part 2

Structured API - Spark DataFrame

Structured APIs Introduction

Preparing the Project Folder

PySpark DataFrame, Schema, and DataTypes

DataFrame Reader and Writer

Challenge Part 1 – Brief

Challenge Part 1 - Data Preparation

Working with Structured Operations

Managing Performance Errors

Reading a JSON File

Columns and Expressions

Filter and Where Conditions

Distinct Drop Duplicates Order By

Rows and Union

Adding, Renaming, and Dropping Columns

Working with Missing or Bad Data

Working with User-Defined Functions

Challenge Part 2 – Brief

Challenge Part 2 - Remove Null Row and Bad Records

Challenge Part 2 - Get the City and State

Challenge Part 2 - Rearrange the Schema

Challenge Part 2 - Write Partitioned DataFrame to Parquet

Aggregations

Aggregations - Setting Up Flight Summary Data

Aggregations - Count and Count Distinct

Aggregations - Min Max Sum SumDistinct AVG

Aggregations with Grouping

Challenge Part 3 – Brief

Challenge Part 3 - Prepare 2019 Data

Challenge Part 3 - Q1 Get the Best Sales Month

Challenge Part 3 - Q2 Get the City that Sold the Most Products

Challenge Part 3 - Q3 When to Advertise

Challenge Part 3 - Q4 Products Bought Together

Introduction to Spark SQL and Databricks

Introduction to DataBricks

Spark SQL Introduction

Create a Databricks Cluster

Creating our First 2 Databricks Notebooks

Reading CSV Files into DataFrame

Creating a Database and Table

Inserting Records into a Table

Exposing Bad Records

Figuring out How to Remove Bad Records

Extract the City and State

Inserting Records to Final Sales Table

What was the Best Month in Sales?

Get the City that Sold the Most Products

Get the Right Time to Advertise

Get the Most Products Sold Together

Create a Dashboard

Summary

Apache Spark 3 for Data Engineering and Analytics with Python

By : David Mngadi

Apache Spark 3 for Data Engineering and Analytics with Python

By: David Mngadi

Overview of this book

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access