Before launching an EMR cluster, you need to decide on the AWS region that will be used to launch the cluster and accordingly, you should have configured your credentials.json
file. As discussed in our initial chapters, choosing a specific AWS region depends on factors such as your business location and latency requirements of connecting your existing data center or office with AWS using the virtual private network and so on for a secure data transfer.
Another important consideration is choosing the right instance type based on the analysis requirements. You would also need to consider an EMR cluster size depending on the size of data to be analyzed and stored in HDFS for processing. One m1.xlarge instance provides 1,680 GB of disk storage, so if you have an HDFS replication factor of 3
, then you need at least three core nodes along with one master node for processing 1 TB of data. However, your cluster size also depends on the MapReduce job...