Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Book Overview & Buying
Table Of Contents

Simplify Big Data Analytics with Amazon EMR

By : Sakti Mishra

5 (10)

Buy this Book

Simplify Big Data Analytics with Amazon EMR

5 (10)

By: Sakti Mishra

Buy this Book

Overview of this book

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Code in Action

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

Free Chapter

Chapter 1: An Overview of Amazon EMR

What is Amazon EMR?

Benefits of Amazon EMR

Decoupling compute and storage

Integration with other AWS services

EMR release history

Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew

Summary

Test your knowledge

Further reading

Chapter 2: Exploring the Architecture and Deployment Options

EMR architecture deep dive

Understanding clusters and nodes

Using S3 versus HDFS for cluster storage

Understanding the cluster life cycle

Building Hadoop jobs with dependencies in a specific EMR release version

EMR deployment options

Summary

Test your knowledge

Further reading

Chapter 3: Common Use Cases and Architecture Patterns

Reference architecture for batch ETL workloads

Reference architecture for clickstream analytics

Reference architecture for interactive analytics and ML

Reference architecture for real-time streaming analytics

Reference architecture for genomics data analytics

Reference architecture for log analytics

Summary

Test your knowledge

Further reading

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Technical requirements

Understanding popular big data applications in EMR

Machine learning frameworks available in EMR

Notebook options available in EMR

Summary

Test your knowledge

Further reading

Section 2: Configuration, Scaling, Data Security, and Governance

Chapter 5: Setting Up and Configuring EMR Clusters

Technical requirements

Setting up and configuring clusters with the EMR console's quick create option

Advanced configuration for cluster hardware and software

Working with AMIs and controlling cluster termination

Troubleshooting and logging in your EMR cluster

Summary

Test your knowledge

Further reading

Chapter 6: Monitoring, Scaling, and High Availability

Technical requirements

Monitoring your EMR cluster

Scaling cluster resources

Cluster cloning and high availability with multiple master nodes

Summary

Test your knowledge

Further reading

Chapter 7: Understanding Security in Amazon EMR

Technical requirements

Understanding the basics of security

AWS IAM integration with Amazon EMR

Understanding data protection in EMR

Role of security groups and interface VPC endpoints

Summary

Test your knowledge

Further reading

Chapter 8: Understanding Data Governance in Amazon EMR

Technical requirements

Understanding data catalog and access management options

Understanding Amazon EMR integration with AWS Lake Formation

Understanding Amazon EMR integration with Apache Ranger

Summary

Test your knowledge

Further reading

Section 3: Implementing Common Use Cases and Best Practices

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Technical requirements

Use case and architecture overview

Implementation steps

Validating the output using Amazon Athena

Spark ETL and Lambda function code walk-through

Summary

Test your knowledge

Further reading

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Technical requirements

Use case and architecture overview

Implementation steps

Validating output using Amazon Athena

Spark Streaming code walk-through

Summary

Test your knowledge

Further reading

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Technical requirements

Apache Hudi overview

Creating an EMR cluster and an EMR notebook

Interactive development with Spark and Hudi

Summary

Test your knowledge

Further reading

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Technical requirements

Overview of AWS Step Functions

Integrating AWS Step Functions to orchestrate EMR jobs

Overview of Apache Airflow and MWAA

Integrating Airflow to trigger EMR jobs

Summary

Test your knowledge

Further reading

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Understanding migration approaches

Migrating data and metadata catalogs

Migrating ETL jobs and Oozie workflows

Testing and validation

Best practices for migration

Summary

Test your knowledge

Further reading

Chapter 14: Best Practices and Cost-Optimization Techniques

Best practices around EMR cluster configurations

Optimization techniques for data processing and storage

Security best practices

Cost-optimization techniques

Limitations of Amazon EMR and possible workarounds

Summary

Test your knowledge

Further reading

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Simplify Big Data Analytics with Amazon EMR

By : Sakti Mishra

Simplify Big Data Analytics with Amazon EMR

By: Sakti Mishra

Overview of this book

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access