Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Deploying Hadoop on Microsoft Windows


In this section, we will look in detail as to how we can build and install Hadoop natively on a Windows system. We will be using Windows 8 to install Hadoop. The same steps can be employed to install Hadoop on Windows Server 2008 or Windows 7. We will be installing Hadoop on a 64-bit Windows OS running on 64-bit hardware.

Prerequisites

Installing Hadoop on Windows requires the following platforms, software, and tools installed:

  • Java JDK: Java is the soul of Hadoop and requires it to be installed on the machine. Java from Oracle comes with the Java Runtime Environment (JRE) and the Java Development Kit (JDK). Hadoop installation requires the JDK. The JDK can be obtained from Oracle's website. To reiterate, it is very important to choose the JDK that is higher than 1.6. We will choose the latest—JDK 1.8. The following screenshot shows the page from where the JDK can be downloaded. In this example, we choose the Windows x64 product. A 32-bit user can choose the Windows x86 product. The download is about 170 MB in size. Once downloaded, it can be installed using the installer in the package. It is very important to choose the right processor architecture and OS for the JDK. Otherwise, there could be undesirable results.

  • Setting the Path variable: Now the Windows Path environment variable has to be set so that the command-line tools can directly pick the Java executable from the path. In the Windows Control Panel, under System Properties, a button is provided for Environment Variables. Clicking on this button opens up the Environment Variables dialog. The appropriate environment variable can be chosen for editing if it is already present, or a new one can be created. We use the existing Path variable and add the bin folder for the Java binaries. In the preceding example, it is C:\Program Files\Java\jdk1.8.0_20\bin\. It is very important to separate the paths using a semicolon. The path can be tested by opening the command prompt and typing java –version. This should give the version number of the installed Java software. The following screenshot shows the setting of the Path environment variable:

  • Setting the JAVA_HOME environment variable: All Hadoop binaries choose the version of Java by looking up the JAVA_HOME directory. It is important to set this variable before starting off with Hadoop deployment and usage. Again, we use the Environment Variables dialog to do this. This time, instead of editing, we click on the New button to add a variable. In the following example, we set JAVA_HOME as C:\Progra~1\Java\jdk1.8.0_20. It is important to note that the Program Files folder has been shortened to a 8-letter path called Progra~1. This is because Hadoop does not handle spaces in the paths. Windows OS understands this 8-letter scheme, as it is a legacy feature.

    The following screenshot shows the actual setting of the JAVA_HOME environment variable:

  • Downloading Hadoop sources: We download the Hadoop sources from the nearest mirror site. It is important to download the sources, compile them, and then deploy Hadoop on Windows for native support. Using the binaries as is throws an error when deploying Hadoop on Windows. At a later point of time, there could be support to install Hadoop from binaries. We choose the latest version of Hadoop to install, Hadoop 2.5.0. As in the following screenshot, we download only the source tar file, hadoop-2.5.0-src.tar.gz. The sources can then be extracted into a local folder. The download is about 15 MB in size.

    In the example, we download and extract the sources to the C:\hdp\hdp directory. It will save you a lot of pain if the directory names are short. Windows has some restrictions on the maximum number of characters in the directory name.

  • Protobuf compiler: Protobuf is a serialization format and the Hadoop build requires this compiler to be available during the build process. The Windows version of the compiler binary needs to be downloaded. In this example, we choose protoc-2.5.0-win32.zip and download it as shown in the following screenshot. Once we download and extract it, we use the Environment Variables dialog to add the bin directory of the protobuf compiler to the Path.

  • Maven Build System: Hadoop is built using the Maven build system. The build system uses specifications specified in the pom.xml file, which is found in the root directory of the Hadoop sources. To install Maven on the Windows machine, we can go to the Maven project page and download the latest Apache Maven binaries. We choose the version 3.2.3. Once downloaded, the ZIP file is extracted and the bin folder is again appended in the Path environment variable for ease of use.

    Download page for Apache Maven 3.2.3

  • The next important thing is to download the Windows SDK. If you are using an x86 machine, it is important to get the x86 build tools. Otherwise, the user needs to get the x64 build tools. Installing a high SKU of Visual Studio may install all the necessary tools. In this example, we will install Visual Studio Express for C++ 2010, which is a free Visual Studio download. This Visual Studio SKU does not come with the Windows SDK. We must separately install the Windows SDK. The version of the Windows SDK installed is 7.1. To verify that the SDK is installed, you can navigate to C:\Program Files\Microsoft SDKs\Windows on your computer. For x86 machines, it will be present in C:\Program Files(x86)\Microsoft SDKs\Windows. Also, it is important to include the SDKs bin folder in the Path environment so that the build process can automatically pick it up. An alternative to using the Windows build tools is installing CMake, but this requires the user to change a few configurations within the pom.xml files in the Hadoop sources.

Building Hadoop

Once all the pre-requisites are in place, Hadoop can be built and packaged. To do the build, you can open the Microsoft Visual Studio command prompt. It sets some of the necessary environment settings. Additionally:

  1. It is important to set the Platform environment variable to x64 or Win32 depending on the Hadoop deployment desired. This can be done using the following command:

    set Platform=x64 
    

    For Win32, use the following command:

    set Platform=Win32
    
  2. It is very important to ensure that the environment variable has the right name. This variable is case sensitive and instructs the Visual Studio project files to use the appropriate build configuration.

  3. The next step is to actually issue the Maven build command. The command mvn package -Pdist,native-win -DskipTests –Dtar is used to start the build. Using the newer JDK can cause some parse issues when generating Javadocs. This can be solved by either using an older JDK such as 1.7 or skipping Javadocs generation. The latter is done by adding the –Dmaven.javadocs.skip=true option in the Maven package command.

    The following screenshot shows the end of the build process. The summary of the standard output shows the status of each build step. Once a failure is encountered, the rest of the steps are skipped. It is also important to have the computer connected to the Internet during the build process. Maven automatically downloads dependencies from configured binary repositories during the build process:

  4. The build yields a target directory. Inside the target directory, Hadoop binaries, samples, and configuration files are bundled in a zipped TAR file. In this example, a hadoop-2.5.0.tar.gz file is generated. We extract the contents of the file to the C:\hdp\hdp path.

Configuring Hadoop

In this section, we will see the different configuration settings for single-node deployment of Hadoop on Windows:

  1. The Hadoop-env.cmd file is present in the etc\hadoop directory at the root of the Hadoop installation. This is the configuration directory for Hadoop. The hadoop-env.cmd command file needs to be modified to set the right environment to execute the Hadoop daemons correctly. The most important configuration is the setting of the JAVA_HOME environment variable. We also set HADOOP_HOME to the root of the Hadoop installation, that is, the path from where we extracted the Hadoop binaries and configuration files. The HADOOP_CONF_DIR and YARN_CONF_DIR environment variables are set to the configuration directories of Hadoop and YARN respectively. The YARN configuration directory is the same as the Hadoop configuration directory in our example. We also add the Hadoop directories to the Path variable. The following script snippet is a sample hadoop-env.cmd script file:

    @rem The java implementation to use.  Required.
    set JAVA_HOME=%JAVA_HOME%
    
    set HADOOP_HOME=c:\hdp\hdp
    
    @rem The jsvc implementation to use. Jsvc is required to run secure datanodes.
    @rem set JSVC_HOME=%JSVC_HOME%
    
    set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop
    
    set YARN_CONF_DIR=%HADOOP_CONF_DIR%
    set PATH=%PATH%;%HADOOP_HOME%\bin
    
    @rem Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
    if exist %HADOOP_HOME%\contrib\capacity-scheduler (
      if not defined HADOOP_CLASSPATH (
        set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
      ) else (
        set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
      )
    )
    
    @rem The maximum amount of heap to use, in MB. Default is 1000.
    @rem set HADOOP_HEAPSIZE=
    @rem set HADOOP_NAMENODE_INIT_HEAPSIZE=""
    
    @rem Extra Java runtime options.  Empty by default.
    @rem set HADOOP_OPTS=%HADOOP_OPTS% -Djava.net.preferIPv4Stack=true
    
    @rem Command specific options appended to HADOOP_OPTS when specified
    if not defined HADOOP_SECURITY_LOGGER (
      set HADOOP_SECURITY_LOGGER=INFO,RFAS
    )
    if not defined HDFS_AUDIT_LOGGER (
      set HDFS_AUDIT_LOGGER=INFO,NullAppender
    )
    
    set HADOOP_NAMENODE_OPTS=-Dhadoop.security.logger=%HADOOP_SECURITY_LOGGER% -Dhdfs.audit.logger=%HDFS_AUDIT_LOGGER% %HADOOP_NAMENODE_OPTS%
    set HADOOP_DATANODE_OPTS=-Dhadoop.security.logger=ERROR,RFAS %HADOOP_DATANODE_OPTS%
    set HADOOP_SECONDARYNAMENODE_OPTS=-Dhadoop.security.logger=%HADOOP_SECURITY_LOGGER% -Dhdfs.audit.logger=%HDFS_AUDIT_LOGGER% %HADOOP_SECONDARYNAMENODE_OPTS%
    
    @rem The following applies to multiple commands (fs, dfs, fsck, distcp etc)
    set HADOOP_CLIENT_OPTS=-Xmx512m %HADOOP_CLIENT_OPTS%
    @rem set HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData %HADOOP_JAVA_PLATFORM_OPTS%"
    
    @rem On secure datanodes, user to run the datanode as after dropping privileges
    set HADOOP_SECURE_DN_USER=%HADOOP_SECURE_DN_USER%
    
    @rem Where log files are stored.  %HADOOP_HOME%/logs by default.
    @rem set HADOOP_LOG_DIR=%HADOOP_LOG_DIR%\%USERNAME%
    
    @rem Where log files are stored in the secure data environment.
    set HADOOP_SECURE_DN_LOG_DIR=%HADOOP_LOG_DIR%\%HADOOP_HDFS_USER%
    
    @rem The directory where pid files are stored. /tmp by default.
    @rem NOTE: this should be set to a directory that can only be written to by 
    @rem       the user that will run the hadoop daemons. Otherwise there is the
    @rem       potential for a symlink attack.
    set HADOOP_PID_DIR=%HADOOP_PID_DIR%
    set HADOOP_SECURE_DN_PID_DIR=%HADOOP_PID_DIR%
    
    @rem A string representing this instance of hadoop. %USERNAME% by default.
    set HADOOP_IDENT_STRING=%USERNAME%
    
  2. Next we configure the core-site.xml file. The most important configuration is setting the fs.default.name property to the HDFS NameNode host and port. In our case, since it is a single node deployment, it points to localhost on port 19000. The following configuration snippet illustrates this setting:

    <configuration>
        <property>
            <name>fs.default.name</name>
           <value>hdfs://0.0.0.0:19000</value>
        </property>
     </configuration>
  3. We then configure the hdfs-site.xml file. Here we set the replication factor to 1 as we are doing a single-node deployment of Hadoop. The following configuration snippet illustrates this setting:

    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>   
        </property> 
    </configuration>
  4. The mapred-site.xml file needs to be configured and pointed to YARN in Hadoop 2.X. The %USERNAME% element can be replaced by the username of the entity submitting the jobs. The following configuration snippet illustrates a sample mapred-site.xml file. If the file is not present, it can be copied from the mapred-site.xml.template file present in the configuration directory:

    <configuration>
        <property>
            <name>mapreduce.job.user.name</name>
            <value>%USERNAME%</value>
        </property>     
        <property>      
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>yarn.apps.stagingDir</name>
            <value>/user/%USERNAME%/staging</value>
        </property>
        <property>
            <name>mapreduce.jobtracker.address</name>
            <value>local</value>
        </property>  
    </configuration>
  5. The yarn-site.xml file is configured for the settings on the ResourceManager and NodeManager daemons. The configurations include setting the daemon endpoints and the log directories, and specifying the shuffle handlers. The following configuration snippet illustrates a sample configuration for the YARN daemons:

    <configuration>
        <property>
            <name>yarn.server.resourcemanager.address</name>
            <value>0.0.0.0:8020</value>
        </property>
        <property>
            <name>yarn.server.resourcemanager.application.expiry.interval</name>
            <value>60000</value>
        </property>
         <property>
            <name>yarn.server.nodemanager.address</name>
            <value>0.0.0.0:45454</value>
        </property>
         <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
        <property>
            <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
            <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        <property>
            <name>yarn.server.nodemanager.remote-app-log-dir</name>
         <value>/app-logs</value>
        </property>
        <property>
            <name>yarn.nodemanager.log-dirs</name>
            <value>/dep/logs/userlogs</value>
        </property>
        <property>
            <name>yarn.server.mapreduce-appmanager.attempt-listener.bindAddress</name>
            <value>0.0.0.0</value>
        </property>
        <property>
            <name>yarn.server.mapreduce-appmanager.client-service.bindAddress</name>
            <value>0.0.0.0</value>
        </property>
        <property>
            <name>yarn.log-aggregation-enable</name>
            <value>true</value>
        </property>
        <property>
            <name>yarn.log-aggregation.retain-seconds</name>
            <value>-1</value>   </property>
        <property>
            <name>yarn.application.classpath</name>
            <value>%HADOOP_CONF_DIR%,%HADOOP_COMMON_HOME%/share/hadoop/common/*,%HADOOP_COMMON_HOME%/share/hadoop/common/lib/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/*,%HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*,%HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/*,%HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*</value>
        </property>
    </configuration>

Deploying Hadoop

Once the configurations are complete, it is time to start the Hadoop daemons. This is done by performing the following steps:

  1. Before starting the daemons, we can format the NameNode by issuing the following command:

    hdfs namenode –format
    

    The following screenshot shows the output of the format command. Now the HDFS is formatted and ready to use. Since we have not specified a particular directory name, the NameNode uses the C:\tmp directory to store all of the metadata.

  2. We then start the HDFS daemons, the NameNode, and the DataNode. This is done by issuing start-dfs.cmd. This command script is present in the %HADOOP_HOME%\sbin folder. The Windows firewall may pop up a notification asking the user to allow the daemons to open a listening port in the firewall. It is important to allow the firewall to be reconfigured so that the DataNode and NameNode can communicate with each other. The following screenshot shows the Windows Firewall screen offering to allow access:

    Once access has been granted, the NameNode and DataNode start in two separate command windows, as shown in the following screenshot. The standard output of each HDFS operation can be examined in these two windows. Issuing HDFS commands on the filesystem once both the daemons are up and running can conduct tests to validate HDFS. You might have to create the user directories using the mkdir command before starting off though.

    The NameNode and DataNode command windows

  3. Next we need to start the YARN to run MapReduce jobs. This can be done by the start-yarn.cmd file present in the sbin folder. Again, the ResourceManager and the NodeManager start off in two separate command windows as illustrated in the following screenshot. The standard output can be examined to see the trace on the ResourceManager and NodeManager.

    The ResourceManager and NodeManager command windows

  4. By navigating to localhost:50070 on the browser, the user should now be able to see the web endpoint for HDFS. The home page is shown in the following screenshot. It gives an overview of the health of HDFS and the different parameters that were used to configure it.

  5. Selecting the Datanodes link on the top bar gives the different DataNodes present in HDFS and the health of each DataNode, as shown in the following screenshot:

  6. The startup progress link in the top bar shows the health of HDFS during startup. This includes statistics about the fsimage and edits file during NameNode startup. It also indicates whether HDFS went into safe mode or not.

  7. The utilities link gives two options: one to browse HDFS and the other to view the logfiles. The browse functionality is based on a search box that can be used to search the HDFS directory structure. Each listing for a file is similar to executing the hdfs dfs –ls command on the directory. It also gives statistics about the block size and a deep link to peek into the contents of the file.