Data Engineering with Google Cloud Platform

By : Adi Wijaya

3 (1)

Buy this Book

Data Engineering with Google Cloud Platform

3 (1)

By: Adi Wijaya

Buy this Book

Overview of this book

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards. Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP. By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1: Getting Started with Data Engineering with GCP

Free Chapter

Chapter 1: Fundamentals of Data Engineering

Understanding the data life cycle

Knowing the roles of a data engineer before starting

Foundational concepts for data engineering

Summary

Exercise

See also

Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer

Technical requirements

Introduction to Cloud Composer

Understanding the working of Airflow

Exercise: Build data pipeline orchestration using Cloud Composer

Summary

Chapter 5: Building a Data Lake Using Dataproc

Technical requirements

Introduction to Dataproc

Exercise – Building a data lake on a Dataproc cluster

Exercise: Creating and running jobs on a Dataproc cluster

Understanding the concept of the ephemeral cluster

Building an ephemeral cluster using Dataproc and Cloud Composer

Summary

Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow

Technical requirements

Processing streaming data

Exercise – Publishing event streams to cloud Pub/Sub

Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS

Summary

Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio

Technical requirements

Unlocking the power of your data with Data Studio

From data to metrics in minutes with an illustrative use case

Understanding how Data Studio can impact the cost of BigQuery

How to create materialized views and understanding how BI Engine works

Summary

Chapter 8: Building Machine Learning Solutions on Google Cloud Platform

Technical requirements

A quick look at machine learning

Exercise – practicing ML code using Python

The MLOps landscape in GCP

Exercise – leveraging pre-built GCP models as a service

Exercise – using GCP in AutoML to train an ML model

Exercise – deploying a dummy workflow with Vertex AI Pipeline

Exercise – deploying a scikit-learn model pipeline with Vertex AI

Summary

Section 3: Key Strategies for Architecting Top-Notch Data Pipelines

Chapter 9: User and Project Management in GCP

Technical requirements

Understanding IAM in GCP

Planning a GCP project structure

Controlling user access to our data warehouse

Practicing the concept of IaC using Terraform

Summary

Chapter 10: Cost Strategy in GCP

Technical requirements

Estimating the cost of your end-to-end data solution in GCP

Tips for optimizing BigQuery using partitioned and clustered tables

Summary

Chapter 11: CI/CD on Google Cloud Platform for Data Engineers

Technical requirements

Introduction to CI/CD

Understanding CI/CD components with GCP services

Exercise – implementing continuous integration using Cloud Build

Exercise – deploying Cloud Composer jobs using Cloud Build

Summary

Exercise: Creating and running jobs on a Dataproc cluster

In this exercise, we will try two different methods to submit a Dataproc job. In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs.

Here are the scenarios that we want to try:

Preparing log data in GCS and HDFS
Developing Spark ETL from HDFS to HDFS
Developing Spark ETL from GCS to GCS
Developing Spark ETL from GCS to BigQuery

Let's look at each of these scenarios in detail.

Preparing log data in GCS and HDFS

The log data is in our GitHub repository, located here:

https://github.com/PacktPublishing/Data-Engineering-with-Google-Cloud-Platform/tree/main/chapter-5/dataset/logs_example

If you haven't cloned the repository...

Data Engineering with Google Cloud Platform

By : Adi Wijaya

Data Engineering with Google Cloud Platform

By: Adi Wijaya

Overview of this book

Related Content you might be interested in

Current Title:

Data Engineering with Google Cloud Platform

Machine Learning with BigQuery ML

Data Exploration and Preparation with BigQuery

Google Cloud Platform for Architects

Exercise: Creating and running jobs on a Dataproc cluster

Preparing log data in GCS and HDFS