HADOOP MULTI NODE CLUSTER

We will build a multi-node cluster using two Ubuntu boxes. First setup two single node cluster and merge these two single node clusters into one multi node cluster in which one Ubuntu box will become the designated master and other box will become only a slave.

First we will edit the ‘/etc/hosts’ of both master and slaves machine. Enter both Ipaddress and hostname of both master and slave machines. This is the way to identify which are the machines in hadoop clustering.

Ø nano /etc/hosts ( For both master and slave machines)

On the Master machine (The machine on which ‘bin/start-dfs.sh’ is run will become the Primary Namenode)

On master machine, update ‘conf/masters’

When we are starting Hadoop daemons using the command ‘bin/start-all.sh’ we have to mention which is the master machine ( machine running namenode, secondarynamenode, jobtracker etc.)

nano /home/hduser/utilities/hadoop-1.0.3/conf/masters

On master machine, update ‘conf/slaves’. Enter all the slave machines in this file. If master node act as the slave machine, we can enter the master hostname in ‘conf/slaves’ file.

nano /home/hduser/utilities/hadoop-1.0.3/conf/slaves

Editing the Configuration files

1. core-site.xml

‘hadoop.tmp.dir’, The directory specified by this property is used to store file system meta information by namenode and block information by datanode.By default two directories by the name and data will be created in the tmp dir.

We need to ensure that ‘hduser’has sufficient permission on the newly provided ‘hadoop.tmp.dir’ .We are configuring it to ‘/home/hduser/app/hadoop/tmp’.

The property ‘fs.default.name’ is required to provide the hostname and port of the namenode

Creating the directory and changing the ownership and permission to ‘hduser’

Ø cd /home/hduser/utilities

Ø sudo mkdir –p /app/hadoop/tmp

Ø sudo chown hduser:hadoop /app/hadoop/tmp

Ø sudo chmod 755 app/hadoop/tmp

setting ownership and permission is very important.If you forget this, you will get into some exceptions while formatting the namenode

Open the core-site.xml file, you can see empty configuration tags. Add the following lines between the configuration tags

Ø nano /home/hduser/hadoop-1.0.3/conf/core-site.xml (ALL machines)

Edit mapred-site.xml (ALL machines)

In the mapred-site.xml, we need to provide the hostname and port for Jobtracker as TaskTrackers would be using this for their communication

Ø sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/mapred-site.xml

Edit hdfs-site.xml

In the hdfs-site.xml, add the following property between the configuration tags

Ø sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/hdfs-site.xml

Starting multi-mode cluster

1. Starting the HDFS daemons:

The NameNode daemon is started on ‘master’ machine, and Datanode daemons are started on all slaves. Run the command bin/start-dfs.sh on the machine you want the Namenode to run on

Ø /home/hduser/utilities/hadoop-1.0.3/bin/start-dfs.sh

2. Starting Mapreduce daemons:

The Jobtracker daemon is started on ‘master’ machine and tasktracker daemons are started on all slaves. Run the command bin/start-mapred.sh on the machine you want the jobtracker to run on

Ø /home/hduser/utilities/hadoop-1.0.3/bin/start-mapred.sh

Stopping multimode cluster

For stopping multimode cluster, the workflow however is the opposite of starting. To stop Mapreduce daemons, the jobtracker is stopped on master and tasktracker daemons are stopped on all slaves. To stop mapreduce daemons, run the command ‘bin/stop-mapred.sh’ on the machine where jobtracker is running.

Ø /home/hduser/utilities/hadoop-1.0.3/bin/stop-mapred.sh

To stop hdfs daemons, run the command ‘bin/stop-dfs.sh’ on the machine where namenode is running

Ø /home/hduser/utilities/hadoop-1.0.3/stop-mapred.sh

The UI of Hdfs daemons is follows

http://<ipaddress / hostname of master >:50070

The UI of Mapred daemons is as follows

http://<ipaddress /hostname of master>:50030

Big Data Handling

Wednesday, July 10, 2013