Book Image

Scaling Big Data with Hadoop and Solr, Second Edition

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Big Data with Hadoop and Solr, Second Edition

By: Hrishikesh Vijay Karambelkar

Overview of this book

Table of Contents (13 chapters)
Scaling Big Data with Hadoop and Solr Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Setting up a Hadoop cluster


In this case, assuming that you already have a single node setup as explained in the previous sections, with ssh being enabled, you just need to change all the slave configurations to point to the master. This can be achieved by first introducing the slaves file in the $HADOOP_PREFIX/etc/Hadoop folder. Similarly, on all slaves, you require the master file in the $HADOOP_PREFIX/etc/Hadoop folder to point to your master server hostname.

Note

While adding new entries for the hostname, one must ensure that the firewall is disabled to allow remote nodes access to different ports. Alternatively, specific ports can be opened/modified by modifying the Hadoop configuration files. Similarly, all the names of nodes that are participating in the cluster should be resolvable through DNS (which stands for Domain Name System), or through the /etc/host entries of Linux.

Once this is ready, let us change the configuration files. Open core-sites.xml, and add the following entry in it:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-server:9000</value>
  </property>
</configuration>

All other configuration is optional. Now, run the servers in the following order: First, you need to format your storage for the cluster; use the following command to do so:

$ $HADOOP_PREFIX/bin/Hadoop dfs namenode -format <Name of Cluster>

This formats the name node for a new cluster. Once the name node is formatted, the next step is to ensure that DFS is up and connected to each node. Start namenode, followed by the data nodes:

$ $HADOOP_PREFIX/sbin/Hadoop-daemon.sh start namenode

Similarly, the datanode can be started from all the slaves.

$ $HADOOP_PREFIX/sbin/Hadoop-daemon.sh start datanode

Keep track of the log files in the $HADOOP_PREFIX/logs folder in order to see that there are no exceptions. Once the HDFS is available, namenode can be accessed through the web as shown here:

The next step is to start YARN and its associated applications. First, start with the RM:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start resourcemanager

Each node must run an instance of one node manager. To run the node manager, use the following command:

$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start nodemanager

Optionally, you can also run Job History Server on the Hadoop cluster by using the following command:

$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver

Once all instances are up, you can see the status of the cluster on the web through the RM UI as shown in the following screenshot. The complete setup can be tested by running the simple wordcount example.

This way, your cluster is set up and is ready to run with multiple nodes. For advanced setup instructions, do visit the Apache Hadoop website at http://Hadoop.apache.org.