Book Image

Simplify Big Data Analytics with Amazon EMR

By : Sakti Mishra
Book Image

Simplify Big Data Analytics with Amazon EMR

By: Sakti Mishra

Overview of this book

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
Table of Contents (19 chapters)
1
Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
6
Section 2: Configuration, Scaling, Data Security, and Governance
11
Section 3: Implementing Common Use Cases and Best Practices

Preface

As the usage of internet-related services, computers, and smart products increases, the amount of data produced by them has also increased exponentially. The data produced by them is extremely valuable for addressing business problems, as you can analyze the data to derive insights that can help in faster decision making and forecasting business growth.

These datasets are large and complex enough that traditional data processing technologies can't handle them efficiently, and that is why distributed processing frameworks such as Hadoop and Spark evolved. Amazon Elastic MapReduce (EMR) provides a managed offering for Hadoop ecosystem services, so that businesses can focus on building analytics pipelines and save time on managing infrastructure. This makes Amazon EMR the top choice for Hadoop, Spark, and big data workloads.

As the amount of data continues to grow, big data analytics will become a common skill that everybody will need to have to be successful in their career or business. Before EMR, it was expensive to try out Hadoop or Spark workloads as they require clusters of servers for setup. But with Amazon EMR's pay-as-you-go model, you can spin up small clusters quickly, scale them as needed, and terminate them when the job finishes.

Organizations that want to get started with Amazon EMR or are planning to migrate existing Hadoop workloads to EMR, as well as college-fresh graduates who want to upskill in EMR, will find this book very useful and will be able to dive deep into different EMR features and architecture patterns.

While writing this book, I have kept in mind that it should be useful to both beginners and technologists who want to learn advanced concepts of EMR. I also expect you to have some basic knowledge of AWS and Hadoop so that you can understand better and easily dive deep into advanced concepts.

By the end of this book, you will be able to comfortably architect and implement Hadoop-/Spark-based solutions with transient (job-based) or persistent (multi-tenant/long-running) EMR clusters. In addition, you will be able to understand how a complete end-to-end data analytics solution can be implemented with Amazon EMR for batch, real-time streaming, or interactive workloads. You will also gain knowledge about migration approaches, best practices, and cost optimization techniques that you can follow while implementing big data analytics workloads with EMR.