Hadoop Single Node Installation

Basic Prerequisites

This section lists required services and some required system configuration

Software’s required

· Java JDK

As of now the recommended and tested versions of java for Hadoop and HBase installation include

Ø Oracle jdk 1.6(u 20,21,26,28,31)

Hadoop requires java 1.6+.It is built and tested on Oracle java, which is the only “supported” JVM

· The latest stable version of Hadoop 1.x.x (here we are using the current stable version of Hadoop-1.0.3)

· The latest stable version of Hbase-0.9x.x(here we are using the current stable version of HBase-0.94.x)

Notes:

For HBase development, selecting Hadoop version is very critical. The Hadoop version should be compactable with HBase version .Following table shows some important information about what versions of Hadoop are supported by various HBase versions. Based on the version of HBase, you should select the most appropriate version of Hadoop.

	HBase-0.92.x	HBase-0.94.x	HBase-0.95
Hadoop-0.20.205	S	X	X
Hadoop-0.22.x	S	X	X
Hadoop-1.0.0-1.0.2	S	S	X
Hadoop-1.0.3+	S	S	S
Hadoop-1.1.x	NT	S	S
Hadoop-0.23.x	X	S	NT
Hadoop-2.x	X	S	S

HBase requires hadoop-1.0.3 at a minimum

Where

S = Supported and tested

X = not supported

NT= It should run, but not tested enough

Installing and configuring java JDK

Step 1:

Before Installing Hadoop, we have to install java. It is recommended to use oracle java 1.6. For checking whether java is already available or not, we are using the Linux command

Java –version

This will show the installed java, if it is already installed. If it is open jdk, remove it and install oracle jdk

Step 2:

Download the stable version of java from the list .The downloaded file can be .bin or .tar file

1. For installing the .bin file

Go to the directory containing the binary file

sudo chmod u+x <filename>.bin

./<filename>.bin

2. For installing the tar file

Sudo chmod u+x <filename>.tar

Sudo tar –xzvf <filename>.tar

Step 3:

Set the JAVA_HOME in /etc/bash.bashrc file.We can use nano or vi editor to edit the files

nano /etc/bash.bashrc

Add the following lines towards the end of file. If JAVA_HOME is already set for open jdk, replace the same with the following lines

#set the JAVA_HOME

export JAVA_HOME=<path from root to that java directory>

export PATH=$JAVA_HOME/bin:$PATH

Use Ctrl+X to save the change to files in nano editor

Note: You can edit the JAVA_HOME in user’s home directory ( $HOME/.bashrc file), the disadvantage of doing so is that JAVA_HOME will be available only for that user

To refresh the ‘bash.bashrc’ file, use source command

source /etc/bash.bashrc

Note:

To effect the changes in the Virtual machine, we have to close or reboot the system. In a real time cluster, if we close the Virtual machine, the data will be lost. So to avoid this, we need to refresh the virtual machine using the command ‘source’.

Step 4: To convert openjdk to Oracle jdk.

Now close the terminal, re-open again and check whether the java installation and path is working as desired

alternatives –-install /usr/bin/java java <path from root to that java directory>/bin/java 2

update-alternatives –-config java

Select the installed oracle jdk number (here number is 2)

Adding a super user for running Hadoop Services

For running Hadoop daemons, we are creating a super user rather than executing hadoop in root. The super user isolates other software’s, service and other users on the same from hadoop installation

We have to create a super user for running Hadoop daemons rather than using root for the same, this is recommended, as it isolates other software’s, service and other users on the same machine from Hadoop installation

We are creating a user ‘hduser’ in group ‘hadoop’.

sudo addgroup hadoop

sudo adduser –-ingroup hadoop hduser

Adding the newly created user to sudo users group

For Hadoop installation, the newly created user have some extra privileges rather than a normal user. For giving some root privileges to newly created user, we are adding the newly created user to sudo users group.

To add ‘hduser’ to sudoers group, open etc/sudoers file using nano text editor

sudo nano /etc/sudoers

Add the following lines to the file

%hduser ALL= (ALL) ALL

Save the changes using Ctrl+X. This will give ‘hduser’ as root privileges

Configuring password less SSH

By doing this, we don’t want to enter the password every time when Hadoop interacts with its nodes. Password less certificates needs to be created and implemented for the ssh communication without asking for user intervention

ssh-keygen –t dsa –p ‘’ –f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Changing the Hostname of a Linux machine Without Rebooting

When we are using virtual machines, sometimes it will take the IPaddress or hostname. When we reboot the virtual machine, The IPaddress may change .It will badly affect the cluster. So to avoid this, we are using the hostname only. That is why we are changing the hostname of the Linux machine without rebooting.

Step 1: Change the hostname

We can change the hostname of Linux system. First you must change the config file that control this.

· In Red hat/Centos/Fedora systems,

We can edit the hostname in /etc/sysconfig/network

nano /etc/sysconfig/network

HOSTNAME=< hostname of the system>

· In Ubuntu/Debain systems,

We can edit the hostname in /etc/hostname

nano /etc/hostname

delete old name and add the new name

Step 2: Update /etc/hosts

Now, you need to edit /etc/hosts file

nano /etc/hosts

In ‘/etc/hosts’ file, we can enter both Ipaddress and hostname. So either hostname or Ipaddress will be taken.

Single Node Hadoop Installation

Step 1: Extracting Hadoop tarball

We are creating a user home for Hadoop installation. Here we are using ‘/home/hduser/utilities’ as user home. You need to extract the tarball in this location and change the permissions recursively on the extracted directory

Here we are using hadoop-1.0.3.tar

mkdir –p /home/hduser/utilities

cd /home/hduser/utilities

sudo tar –xzvf hadoop-1.0.3.tar.gz

sudo chown –R hduser:hadoop hadoop-1.0.3

Step 2: Configuring Hadoop on environment variables

We are adding HADOOP_HOME as environment variables on bash.bashrc files. By doing this Hadoop commands can access every user.

sudo nano /etc/bash.bashrc

Append the following lines to add HADOOP_HOME to PATH

#set HADOOP_HOME

export HADOOP_HOME=/home/hduser/utilities/hadoop-1.0.3

export PATH=$HADOOP_HOME/bin:$PATH

Step 3: Configuring Java for Hadoop

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/hadoop-env.sh

JAVA_HOME will be commented by default. Edit the value for JAVA_HOME with your installation path and uncomment the line. The bin folder should not contain this JAVA_HOME path

#The Java Implementation to use

export JAVA_HOME=<absolute path to java directory>

Step 4: Configuring Hadoop Properties

In Hadoop, we have three configuration files core-site.xml, mapred-site.xml, hdfs-site.xml present in HADOOP_HOME/conf directory.

Editing the Configuration files

1. Core-site.xml

‘hadoop.tmp.dir’, the directory specified by this property is used to store file system Meta information by namenode and block information by datanode.By default two directories by the name and data will be created in the tmp dir.

We need to ensure that ‘hduser’ has sufficient permission on the newly provided ‘hadoop.tmp.dir’ .We are configuring it to ‘/home/hduser/app/hadoop/tmp’.

The property ‘fs.default.name’ is required to provide the hostname and port of the namenode

Creating the directory and changing the ownership and permission to ‘hduser’

cd /home/hduser/utilities

sudo mkdir –p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo chmod 755 app/hadoop/tmp

Setting ownership and permission is very important. If you forget this, you will get into some exceptions while formatting the namenode

Open the core-site.xml file, you can see empty configuration tags. Add the following lines between the configuration tags

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/core-site.xml

<name>hadoop.tmp.dir</name>

<value>/home/hduser/utilities/app/hadoop/tmp</value>

A base for other temporary directories.

</description>

</property>

<name>fs.default.name</name>

<description>the name of the default file system</description>

</property>

2. hdfs-site.xml

It is used for file systems and storage. In the hdfs-site.xml, add the following property between the configuration tags

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/hdfs-site.xml

<name>dfs.replication</name>

<description>Default block replication</description>

</property>

3. Mapred-site.xml

This is used for processing. In the mapred-site.xml, we need to provide the hostname and port for Jobtracker as TaskTrackers would be using this for their communication

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/mapred-site.xml

<name>mapred.job.tracker</name>

The host and port that the MapReduce job tracker runs

</description>

</property>

Step 5: Formatting NameNode

Before starting the hdfs daemons like Namenode for the first time, it is mandatory that you format the Namenode/hdfs.This is only for the first run, for subsequent runs, the formatting of namenode will lose all data .Be careful not to format an already running cluster, even if you need to restart the Namenode daemon.

Namenode can be formatted as

/home/hduser/utilities/hadoop/bin/hadoop namendoe –format

The output console may look like this

Step 6: Starting Hadoop Daemons

/home/hduser/utilities/hadoop-1.0.3/bin/start-all.sh

This will run all the hadoop daemons Namenode, datanode, secondarynamenode, jobtracker, tasktracker

For Stopping Hadoop daemons

We are using the command

/home/hduser/utilities/hadoop-1.0.3/bin/stop-all.sh

This will stop all Hadoop daemons. Only Jps is showing after stopping the Hadoop daemons

Big Data Handling

Wednesday, July 10, 2013