Decoupling compute and storage

EMR architecture deep dive

Chapter 2: Exploring the Architecture and Deployment Options

Understanding clusters and nodes

Using S3 versus HDFS for cluster storage

Understanding the cluster life cycle

Building Hadoop jobs with dependencies in a specific EMR release version

EMR deployment options

Reference architecture for batch ETL workloads

Chapter 3: Common Use Cases and Architecture Patterns

Reference architecture for clickstream analytics

Reference architecture for interactive analytics and ML

Reference architecture for real-time streaming analytics

Reference architecture for genomics data analytics

Reference architecture for log analytics

Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR

Understanding popular big data applications in EMR

Machine learning frameworks available in EMR

Notebook options available in EMR

Section 2: Configuration, Scaling, Data Security, and Governance

Chapter 5: Setting Up and Configuring EMR Clusters

Setting up and configuring clusters with the EMR console's quick create option

Advanced configuration for cluster hardware and software

Working with AMIs and controlling cluster termination

Troubleshooting and logging in your EMR cluster

Chapter 6: Monitoring, Scaling, and High Availability

Monitoring your EMR cluster

Scaling cluster resources

Cluster cloning and high availability with multiple master nodes

Chapter 7: Understanding Security in Amazon EMR

Understanding the basics of security

AWS IAM integration with Amazon EMR

Understanding data protection in EMR

Role of security groups and interface VPC endpoints

Chapter 8: Understanding Data Governance in Amazon EMR

Understanding data catalog and access management options

Understanding Amazon EMR integration with AWS Lake Formation

Understanding Amazon EMR integration with Apache Ranger

Section 3: Implementing Common Use Cases and Best Practices

Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark

Use case and architecture overview

Implementation steps

Validating the output using Amazon Athena

Spark ETL and Lambda function code walk-through

Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming

Use case and architecture overview

Implementation steps

Validating output using Amazon Athena

Spark Streaming code walk-through

Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi

Creating an EMR cluster and an EMR notebook

Apache Hudi overview

Interactive development with Spark and Hudi

Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA

Overview of AWS Step Functions

Integrating AWS Step Functions to orchestrate EMR jobs

Overview of Apache Airflow and MWAA

Integrating Airflow to trigger EMR jobs

Understanding migration approaches

Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR

Migrating data and metadata catalogs

Migrating ETL jobs and Oozie workflows

Testing and validation

Best practices for migration

Best practices around EMR cluster configurations

Chapter 14: Best Practices and Cost-Optimization Techniques

Optimization techniques for data processing and storage

Security best practices

Cost-optimization techniques

Limitations of Amazon EMR and possible workarounds