Wednesday, July 10, 2013

Hadoop Single Node Installation



Hadoop Single Node Installation


Basic Prerequisites

This section lists required services and some required system configuration

Software’s required

·         Java JDK
As of now the recommended and tested versions of java for Hadoop and HBase installation include
Ø  Oracle jdk 1.6(u 20,21,26,28,31)
Hadoop requires java 1.6+.It is built and tested on Oracle java, which is the only “supported” JVM
·         The latest stable version of Hadoop 1.x.x  (here we are using the current stable version of Hadoop-1.0.3)
·         The latest stable version of Hbase-0.9x.x(here we are using the current stable version of HBase-0.94.x)
      Notes:
 For HBase development, selecting Hadoop version is very critical. The Hadoop version should be compactable with HBase version .Following table shows some important information about what versions of Hadoop are supported by various HBase versions. Based on the version of HBase, you should select the most appropriate version of Hadoop.


HBase-0.92.x
HBase-0.94.x
HBase-0.95
Hadoop-0.20.205
S
X
X
Hadoop-0.22.x
S
X
X
Hadoop-1.0.0-1.0.2
S
S
X
Hadoop-1.0.3+
S
S
S
Hadoop-1.1.x
NT
S
S
Hadoop-0.23.x
X
S
NT
Hadoop-2.x
X
S
S
HBase requires hadoop-1.0.3 at a minimum
Where
 S = Supported and tested
X = not supported
NT= It should run, but not tested enough

Installing and configuring java JDK

   Step 1:

Before Installing Hadoop, we have to install java. It is recommended to use oracle java 1.6. For checking whether java is already available or not, we are using the Linux command
    Java –version
This will show the installed java, if it is already installed. If it is open jdk, remove it and install oracle jdk

  Step 2:

 Download the stable version of java from the list .The downloaded file can be    .bin or .tar              file
1.    For installing the .bin file
       Go to the directory containing the binary file

 sudo chmod u+x <filename>.bin
 ./<filename>.bin

2.    For installing the tar file
 Sudo chmod u+x <filename>.tar
 Sudo tar –xzvf <filename>.tar

Step 3:

          Set the JAVA_HOME in /etc/bash.bashrc file.We can use nano or vi editor to       edit the files

 nano /etc/bash.bashrc
Add the following lines towards the end of file. If JAVA_HOME is already set for   open jdk, replace the same with the following lines
#set the JAVA_HOME
export JAVA_HOME=<path from root to that java directory>
export PATH=$JAVA_HOME/bin:$PATH
Use Ctrl+X to save the change to files in nano editor
Note: You can edit the JAVA_HOME in user’s home directory ( $HOME/.bashrc file), the disadvantage of doing so is that JAVA_HOME will be available only for that user



To refresh the ‘bash.bashrc’ file, use source command
source /etc/bash.bashrc    

Note:
  To effect the changes in the Virtual machine, we have to close or reboot the system. In a real time cluster, if we close the Virtual machine, the data will be lost. So to avoid this, we need to refresh the virtual machine using the command ‘source’.

    Step 4: To convert openjdk to Oracle jdk.

Now close the terminal, re-open again and check whether the java installation   and path is working as desired
Or
alternatives –-install /usr/bin/java java <path from root to that java directory>/bin/java 2
update-alternatives –-config java   
Select the installed oracle jdk number (here number is 2)

Adding a super user for running Hadoop Services

For running Hadoop daemons, we are creating a super user rather than executing hadoop in root. The super user isolates other software’s, service and other users on the same from hadoop installation
We have to create a super user for running Hadoop daemons rather than using root for the same, this is recommended, as it isolates other software’s, service and other users on the same machine from Hadoop installation
We are creating a user ‘hduser’ in group ‘hadoop’.
sudo addgroup hadoop
sudo adduser –-ingroup hadoop hduser

Adding the newly created user to sudo users group

         For Hadoop installation, the newly created user have some extra privileges rather than a normal user. For giving some root privileges to newly created user, we are adding the newly created user to sudo users group.
 To add    ‘hduser’ to sudoers group, open etc/sudoers file using nano text   editor
sudo nano /etc/sudoers
Add the following lines to the file
%hduser ALL= (ALL) ALL
Save the changes using Ctrl+X. This will give ‘hduser’ as root privileges

Configuring password less SSH

By doing this, we don’t want to enter the password every time when Hadoop interacts with its nodes. Password less certificates needs to be created and implemented for the ssh communication without asking for user intervention
ssh-keygen –t dsa –p ‘’ –f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Changing the Hostname of a Linux machine Without Rebooting

 When we are using virtual machines, sometimes it will take the IPaddress or hostname. When we reboot the virtual machine, The IPaddress may change .It will badly affect the cluster. So to avoid this, we are using the hostname only. That is why we are changing the hostname of the Linux machine without rebooting.

Step 1: Change the hostname

We can change the hostname of Linux system. First you must change the config file that control this. 
·         In Red hat/Centos/Fedora systems,
          We can edit the hostname in /etc/sysconfig/network
nano /etc/sysconfig/network
HOSTNAME=< hostname of the system>

·         In Ubuntu/Debain systems,
 We can edit the hostname in /etc/hostname
nano /etc/hostname
delete old name and add the new name
 

 

Step 2:  Update /etc/hosts
Now, you need to edit /etc/hosts file

nano /etc/hosts
In ‘/etc/hosts’ file, we can enter both Ipaddress and hostname. So either hostname or Ipaddress will be taken.
 
         


Single Node Hadoop Installation


Step 1:  Extracting Hadoop tarball

We are creating a user home for Hadoop installation. Here we are using ‘/home/hduser/utilities’ as user home. You need to extract the tarball in this location and change the permissions recursively on the extracted directory
         Here we are using hadoop-1.0.3.tar
mkdir –p /home/hduser/utilities
cd /home/hduser/utilities
sudo tar –xzvf hadoop-1.0.3.tar.gz
sudo chown –R hduser:hadoop hadoop-1.0.3
Step 2: Configuring Hadoop on environment variables
We are adding HADOOP_HOME as environment variables on bash.bashrc files. By doing this Hadoop commands can access every user.
sudo nano /etc/bash.bashrc
Append the following lines to add HADOOP_HOME to PATH
  #set HADOOP_HOME
export HADOOP_HOME=/home/hduser/utilities/hadoop-1.0.3
export PATH=$HADOOP_HOME/bin:$PATH
Step 3: Configuring Java for Hadoop
sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/hadoop-env.sh
JAVA_HOME will be commented by default. Edit the value for JAVA_HOME with your installation path and uncomment the line. The bin folder should not contain this JAVA_HOME path
#The Java Implementation to use
export JAVA_HOME=<absolute path to java directory>

Step 4: Configuring Hadoop Properties

In Hadoop, we have three configuration files core-site.xml, mapred-site.xml, hdfs-site.xml present in HADOOP_HOME/conf directory.

                             Editing the Configuration files

1. Core-site.xml
‘hadoop.tmp.dir’, the directory specified by this property is used to store file system Meta information by namenode and block information by datanode.By default two directories by the name and data will be created in the tmp dir.
We need to ensure that ‘hduser’ has sufficient permission on the newly provided ‘hadoop.tmp.dir’ .We are configuring it to ‘/home/hduser/app/hadoop/tmp’.

The property ‘fs.default.name’ is required to provide the hostname and port of the namenode
Creating the directory and changing the ownership and permission to ‘hduser’

cd /home/hduser/utilities
sudo mkdir –p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
sudo chmod 755 app/hadoop/tmp
Setting ownership and permission is very important. If you forget this, you will get into some exceptions while formatting the namenode
Open the core-site.xml file, you can see empty configuration tags. Add the following lines between the configuration tags
sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hduser/utilities/app/hadoop/tmp</value>  
      <description>
      A base for other temporary directories.
      </description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://<hostname/Ip address of the system where namenode is installed>:54310</value>
  <description>the name of the default file system</description>
</property>

2.            hdfs-site.xml
It is used for file systems and storage. In the hdfs-site.xml, add the following property between the configuration tags

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/hdfs-site.xml

     <property>
       <name>dfs.replication</name>
       <value>1</value>
       <description>Default block replication</description>
     </property>

3.             Mapred-site.xml
 This is used for processing. In the mapred-site.xml, we need to provide the hostname and    port for Jobtracker as TaskTrackers would be using this for their communication

sudo nano /home/hduser/utilities/hadoop-1.0.3/conf/mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value><hostname/ipaddress of the system where jobtracker is installed   >:54311</value>
      <description>
      The host and port that the MapReduce job tracker runs
      </description>
</property>
 

Step 5: Formatting NameNode

Before starting the hdfs daemons like Namenode for the first time, it is mandatory that you format the Namenode/hdfs.This is only for the first run, for subsequent runs, the formatting of namenode will lose all data .Be careful not to format an already running cluster, even if you need to restart the Namenode daemon.

Namenode can be formatted as

  /home/hduser/utilities/hadoop/bin/hadoop namendoe –format
The output console may look like this
     

Step 6:  Starting Hadoop Daemons


/home/hduser/utilities/hadoop-1.0.3/bin/start-all.sh

This will run all the hadoop daemons Namenode, datanode, secondarynamenode, jobtracker, tasktracker




For Stopping Hadoop daemons

                       We are using the command
/home/hduser/utilities/hadoop-1.0.3/bin/stop-all.sh
This will stop all Hadoop daemons. Only Jps is showing after stopping the Hadoop daemons

No comments:

Post a Comment