HDFS stores data as data blocks distributed on multiple machines. So, when a large file is put onto HDFS, it will first be split into a number of data blocks. These data blocks are then distributed by the NameNode to the DataNodes in the cluster. The granularity of the data blocks can affect the distribution and parallel execution of the tasks.
Based on the property of the jobs being executed, one block size might result in better performance than others. We will guide you through steps to configure a proper block size for the Hadoop cluster.
We assume that the Hadoop cluster has been properly configured and all the daemons are running without any issues.
Log in from the Hadoop cluster administrator machine to the master node using the following command:
ssh hduser@master