Book Image

Data Engineering with Google Cloud Platform

By : Adi Wijaya
3 (1)
Book Image

Data Engineering with Google Cloud Platform

3 (1)
By: Adi Wijaya

Overview of this book

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards. Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP. By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.
Table of Contents (17 chapters)
1
Section 1: Getting Started with Data Engineering with GCP
4
Section 2: Building Solutions with GCP Components
11
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines

Knowing the roles of a data engineer before starting

In the later chapters, we will spend much of our time doing practical exercises to understand the data engineering concepts. But before that, let's quickly take a look at the data engineer role. 

The job role is getting more and more popular now, but the terminology itself is relatively new compared to other job roles, such as accountant, lawyer, doctor, and many other well-established job roles. The impact is that sometimes there is still a debate of what a data engineer should and shouldn't do. 

For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:

  1. Examine your condition.
  2. Make a diagnosis of your health issues.
  3. Prescribe medicine. 

The doctor wouldn't do the following:

  1. Clean the hospital.
  2. Make the medicine.
  3. Manage hospital administration.

It's clear, and it applies to most well-established job roles. But how about data engineers?

This is just a very short list of examples of what data engineers should or shouldn't be responsible for:

  • Handle all big data infrastructures and software installation.
  • Handle application databases.
  • Design the data warehouse data model.
  • Analyze big data to transform raw data into meaningful information.
  • Create a data pipeline for machine learning.

The unclear condition is unavoidable since it's a new role and I believe it will be more and more established following the maturity of data science. In this section, let's try to understand what a data engineer is and despite many combinations of responsibilities, what you should focus on as a data engineer.

Data engineer versus data scientist

A data engineer is someone who designs and builds data pipelines. 

The definition is that simple, but I found out that the question about the different between a data engineer versus a data scientist is still one of the most frequently asked questions when someone wants to start their data career. The hype of data scientists on the internet is one of the drivers; for example, up until today people still like to quote the following:

"Data scientist: the sexiest job of the 21st Century"

– Harvard Business Review

The data scientist role was originally invented to refer to groups of people who are highly curious and able to utilize big data technologies for business purposes back in 2008. But since the technologies are maturing and becoming more complex, people start to realize that it's too much. It's very rare for a company to hire someone who knows how to do all of the following: 

  • How to handle big data infrastructure
  • Properly design and build ETL pipelines
  • Train machine learning models 
  • Understand deeply about the company's business 

Not that it's impossible, some people do have this knowledge, but from a company's point of view, it's not practical.

These days, for better focus and scalability, the data scientist role can be split into many different roles, for example, data analyst, machine learning engineer, and business analyst. But one of the most popular and realized to be very important roles is data engineer.

The focus of data engineers

Let's map the data engineer role to our data life cycle diagram Figure 1.5 from the previous section. 

In the diagram, I added two underlying components:

  • Job Orchestrator: Design and build a job dependency and scheduler that runs data movement from upstream to downstream.
  • Infrastructure: Provision the required data infrastructure to run the data pipelines.

And on each step, I added numbers from 1 to 3. The numbers will help you to identify which components are the data engineer's main responsibility. This diagram works together with Figure 1.7, a data engineer-focused diagram to map the numbering. First, let's check this data life cycle diagram that we discussed before with the numbering on it:

Figure 1.6 – Data life cycle flows with focus numbering

Figure 1.6 – Data life cycle flows with focus numbering

After seeing the numbering on the data life cycle, check this diagram that illustrates the focus points of a data engineer:

Figure 1.7 – Data engineer-focused diagram

Figure 1.7 – Data engineer-focused diagram

The diagram shows the distribution of the knowledge area from the end-to-end data life cycle. At the center of the diagram (number 3) are the jobs that are the key focus of data engineers, and I will call it the core.  

Those numbered 2 are the good to have area. For example, it's still common in small organizations that data engineers need to build a data mart for business users. 

Important Note

Designing and building a data mart is not as simple as creating tables in a database. Someone who builds a data mart needs to be able to talk to business people and gather requirements to serve tables from a business perspective, which is one of the reasons it's not part of the core.

While how to collect data to a data lake is part of the data engineer's responsibility, exporting data from operational application databases is often done by the application development team, for example, dumping MySQL tables as CSV in staging storage.

Those numbered 1 are the good to know area. For example, it's rare that a data engineer needs to be responsible for building application databases, developing machine learning models, maintaining infrastructure, and creating dashboards. It is possible, but less likely. The discipline needs knowledge that is a little bit too far from the core.

After learning about the three focus areas, now let's retrospect our understanding and vision about data engineers. Study the diagram carefully and answer these questions.

  • What are your current focus areas as an individual?
  • What are your current job's role focus areas (or if you are a student, your study areas)?
  • What is your future goal in the data science world?

Depending on your individual answers, check with the diagram – do you have all the necessary skills at the core? Does your current job give you experience in the core? Are you excited if you could master all subjects at the core in the near future?

From my experience, what is important to data engineers is the core. Even though there are a variety of data engineers' expectations, responsibilities, and job descriptions in the market, if you are new to the role, then the most important thing is to understand what the core of a data engineer is. 

The diagram gives you guidance on what type of data engineers you are or will be. The closer you are to the core, the more of a data engineer you are. You are on the right track and in the right environment to be a good data engineer. 

In scenarios where you are at the core, plus other areas beside it, then you are closer to a full-stack data expert; as long as you have a strong core, if you are able to expand your expertise to the good to have and good to know areas, you will have a good advantage in your data engineering career. But if you focus on other non-core areas, I suggest you find a way to master the core first. 

In this section, we learned about the role of a data engineer. If you are not familiar with the cores, the next section will be your guidance to the fundamental concepts in data engineering.