Book Image

Data Engineering with AWS

By : Gareth Eagar
Book Image

Data Engineering with AWS

By: Gareth Eagar

Overview of this book

Written by a Senior Data Architect with over twenty-five years of experience in the business, Data Engineering for AWS is a book whose sole aim is to make you proficient in using the AWS ecosystem. Using a thorough and hands-on approach to data, this book will give aspiring and new data engineers a solid theoretical and practical foundation to succeed with AWS. As you progress, you’ll be taken through the services and the skills you need to architect and implement data pipelines on AWS. You'll begin by reviewing important data engineering concepts and some of the core AWS services that form a part of the data engineer's toolkit. You'll then architect a data pipeline, review raw data sources, transform the data, and learn how the transformed data is used by various data consumers. You’ll also learn about populating data marts and data warehouses along with how a data lakehouse fits into the picture. Later, you'll be introduced to AWS tools for analyzing data, including those for ad-hoc SQL queries and creating visualizations. In the final chapters, you'll understand how the power of machine learning and artificial intelligence can be used to draw new insights from data. By the end of this AWS book, you'll be able to carry out data engineering tasks and implement a data pipeline on AWS independently.
Table of Contents (19 chapters)
1
Section 1: AWS Data Engineering Concepts and Trends
6
Section 2: Architecting and Implementing Data Lakes and Data Lake Houses
13
Section 3: The Bigger Picture: Data Analytics, Data Visualization, and Machine Learning

The rise of big data as a corporate asset

You don't need to look too far or too hard these days to hear about how big data and data analytics are transforming organizations and having an impact on society as a whole. We hear about how companies such as TikTok analyze large quantities of data to make personalized recommendations about which clip to show a user next. Also, we know how Amazon recommends products a customer may be interested in based on their purchase history. We read headlines about how big data could revolutionize the healthcare industry, or how stock pickers turn to big data to find the next breakout stock performer when the markets are down.

The most valuable companies in the US today are companies that are masters of managing huge data assets effectively, with the top five most valuable companies in Q4 2021 being the following:

  • Microsoft
  • Apple
  • Alphabet (Google)
  • Amazon
  • Tesla

For a long time, it was companies that managed natural gas and oil resources, such as ExxonMobil, that were high on the list of the most valuable companies on the US stock exchange. Today, ExxonMobil will often not even make the list of the top 30 companies. It is no wonder that the number of job listings for people with skillsets related to big data is on the rise.

There is also no doubt that data, when harnessed correctly and optimized for maximum analytic value, can be a game-changer for an organization. At the same time, those companies that are unable to effectively utilize their data assets risk losing a competitive advantage to others that do have a comprehensive data strategy and effective analytic and machine learning programs.

Organizations today tend to be in one of the following three states:

  • They have an effective data analytics and machine learning program that differentiates them from their competitors.
  • They are conducting proof of concept projects to evaluate how analytic and machine learning programs can help them achieve a competitive advantage.
  • Their leaders are having sleepless nights worrying about how their competitors are using analytics and machine learning programs to achieve a competitive advantage over them.

No matter where an organization currently is in their data journey, if they have been in existence for a while, they have likely faced a number of common data-related challenges. Let's look at how organizations have typically handled the challenge of ever-growing datasets.