-
Book Overview & Buying
-
Table Of Contents
Simplify Big Data Analytics with Amazon EMR
By :
We have recommended using Amazon S3 as the EMR cluster's persistent storage as it provides better reliability, support for transient clusters, and it is cost-effective. But there are several best practices we can follow while storing the data in Amazon S3 or an HDFS cluster.
Let's understand some of the general best practices that you can follow to get better performance and save costs from a storage and processing perspective.
As part of your cluster storage, there are some general best practices that apply to both Amazon S3 and HDFS cluster storage. The following are a few of the most important ones.
You might be receiving files in CSV, JSON, or as TXT files, but after processing through the ETL process, when you write to a data lake based on S3 or HDFS, you should choose the right file format to get the best performance...