Book Image

Simplify Big Data Analytics with Amazon EMR

By : Sakti Mishra

Book Image

Simplify Big Data Analytics with Amazon EMR

By: Sakti Mishra

Overview of this book

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS. This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR. By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share Your Thoughts

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR

Free Chapter

Chapter 1: An Overview of Amazon EMR

Chapter 1: An Overview of Amazon EMR

What is Amazon EMR?

Benefits of Amazon EMR

Decoupling compute and storage

Integration with other AWS services

EMR release history

Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew

Test your knowledge

Further reading

Chapter 2: Exploring the Architecture and Deployment Options

Chapter 2: Exploring the Architecture and Deployment Options

EMR architecture deep dive

Understanding clusters and nodes

Using S3 versus HDFS for cluster storage

Understanding the cluster life cycle

Building Hadoop jobs with dependencies in a specific EMR release version

EMR deployment options

Test your knowledge

Further reading

Chapter 3: Common Use Cases and Architecture Patterns

Chapter 3: Common Use Cases and Architecture Patterns

Reference architecture for batch ETL workloads

Reference architecture for clickstream analytics

Reference architecture for interactive analytics and ML

Reference architecture for real-time streaming analytics

Reference architecture for genomics data analytics

Reference architecture for log analytics

Test your knowledge

Further reading

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Technical requirements

Understanding popular big data applications in EMR

Machine learning frameworks available in EMR

Notebook options available in EMR

Test your knowledge

Further reading

Section 2: Configuration, Scaling, Data Security, and Governance

Section 2: Configuration, Scaling, Data Security, and Governance

Chapter 5: Setting Up and Configuring EMR Clusters

Chapter 5: Setting Up and Configuring EMR Clusters

Technical requirements

Setting up and configuring clusters with the EMR console's quick create option

Advanced configuration for cluster hardware and software

Working with AMIs and controlling cluster termination

Troubleshooting and logging in your EMR cluster

Test your knowledge

Further reading

Chapter 6: Monitoring, Scaling, and High Availability

Chapter 6: Monitoring, Scaling, and High Availability

Technical requirements

Monitoring your EMR cluster

Scaling cluster resources

Cluster cloning and high availability with multiple master nodes

Test your knowledge

Further reading

Chapter 7: Understanding Security in Amazon EMR

Chapter 7: Understanding Security in Amazon EMR

Technical requirements

Understanding the basics of security

AWS IAM integration with Amazon EMR

Understanding data protection in EMR

Role of security groups and interface VPC endpoints

Test your knowledge

Further reading

Chapter 8: Understanding Data Governance in Amazon EMR

Chapter 8: Understanding Data Governance in Amazon EMR

Technical requirements

Understanding data catalog and access management options

Understanding Amazon EMR integration with AWS Lake Formation

Understanding Amazon EMR integration with Apache Ranger

Test your knowledge

Further reading

Section 3: Implementing Common Use Cases and Best Practices

Section 3: Implementing Common Use Cases and Best Practices

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Technical requirements

Use case and architecture overview

Implementation steps

Validating the output using Amazon Athena

Spark ETL and Lambda function code walk-through

Test your knowledge

Further reading

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Technical requirements

Use case and architecture overview

Implementation steps

Validating output using Amazon Athena

Spark Streaming code walk-through

Test your knowledge

Further reading

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Technical requirements

Apache Hudi overview

Creating an EMR cluster and an EMR notebook

Interactive development with Spark and Hudi

Test your knowledge

Further reading

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Technical requirements

Overview of AWS Step Functions

Integrating AWS Step Functions to orchestrate EMR jobs

Overview of Apache Airflow and MWAA

Integrating Airflow to trigger EMR jobs

Test your knowledge

Further reading

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Understanding migration approaches

Migrating data and metadata catalogs

Migrating ETL jobs and Oozie workflows

Testing and validation

Best practices for migration

Test your knowledge

Further reading

Chapter 14: Best Practices and Cost-Optimization Techniques

Chapter 14: Best Practices and Cost-Optimization Techniques

Best practices around EMR cluster configurations

Optimization techniques for data processing and storage

Security best practices

Cost-optimization techniques

Limitations of Amazon EMR and possible workarounds

Test your knowledge

Further reading

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Test your knowledge

Before moving on to the next chapter, test your knowledge with the following questions:

Assume that as part of your EMR cluster, you have some custom applications running that will be interacting with AWS services directly instead of executing Hadoop or Spark jobs. Your custom application needs to authenticate itself with AWS IAM to interact with the AWS services and should also have required privileges. How would you enable your application to authenticate itself with AWS IAM to get temporary credentials for access?
Assume that you are using Amazon S3 as your persistent data store in EMR and your organization has strict security rules to encrypt all the data you store. You have your own custom encryption keys that need to be used to encrypt your data. How would you ensure that EMR uses your custom key to encrypt data at rest?
Assume that you have an EMR notebook that needs to push or pull code from the GitHub repository and you have required IAM...