Hadoop: Setup a cluster test system

From Define Wiki
Revision as of 08:59, 25 July 2012 by David (talk | contribs) (→‎Start up Hadoop)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Tested with stable release 1.0.3 on ubuntu 12.04

Installation

Ensure the following: - Install hadoop on all systems (tar zxvf hadoop in /opt) - Have passwordless ssh between all hosts - Useful to have pdsh installed (running commands across all hosts) and csync2 to keep the hadoop configuration files syncd. - Ensure Java is installed on all nodes (apt-get install openjdk-7-jre on 12.04)

Default Settings in Hadoop

The following files contain all the default settings for hadoop. These can changed in the site specific configuration files.

  - src/core/core-default.xml
  - src/hdfs/hdfs-default.xml 
  - src/mapred/mapred-default.xml

The following files can be used to override any default parameters in the files above (site configuration files)

  - conf/core-site.xml 
  - conf/hdfs-site.xml 
  - conf/mapred-site.xml

Setup Hadoop Cluster

Setup the local environment and control hadoop startup variables (at very least, set $JAVA_HOME!)

conf/hadoop-env.sh

  ..
  export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre

conf/core-site.xml

<xml>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://hostname:9000</value>
    </property>

</xml>

conf/hdfs-site.xml

Note: In this configuration file we are only using one disk per host /data/hadoop/dfs/name is the hadoop data location. <xml>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/data/hadoop/dfs/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/data/hadoop/dfs/data</value>
    </property>
    <property>
        <name>fs.checkpoint.dir</name>
        <value>/data/hadoop/dfs/namesecondary</value>
    </property>

</xml>

conf/mapred-site.xml

<xml>

    <property>
        <name>mapred.job.tracker</name>
        <value>hostname:9001</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/data/hadoop/mapred/local</value>
    </property>
    <property>
        <name>mapred.system.dir</name>
        <value>/data/hadoop/mapred/system</value>
    </property>
    <property>
        <name>mapred.tasktracker.map.tasks.maximum</name>
        <value>512</value>
        <description>The maximum number of map tasks that will be run
        simultaneously by a task tracker.
        </description>
    </property>
    <property>
        <name>mapred.tasktracker.reduce.tasks.maximum</name>
        <value>512</value>
        <description>The maximum number of reduce tasks that will be run
        simultaneously by a task tracker.
        </description>
    </property>

</xml>

conf/slaves

This is only a line seperated file of hosts that will be data nodes. Ensure ssh password-less access between all hosts for setup.

host1
host2
host3
...

Start up Hadoop

On each of the steps below, verify the output for errors.

Format the namenode

  ./bin/hadoop namenode -format

Start DFS

  ./bin/start-dfs.sh

Start MapReduce daemons

  ./bin/start-mapred.sh
  • Note: The two start-* commands could be replaced with start-all.sh.
  • To stop the daemons, run stop-*.sh.
  • If you run in to problem launching the daemons, start up manually and check output for errors using:
  # namenode
  nohup nice -n 0 /opt/hadoop-1.0.3/libexec/../bin/hadoop --config /opt/hadoop-1.0.3/libexec/../conf namenode 

  # datanode
  nohup nice -n 0 /opt/hadoop-1.0.3/libexec/../bin/hadoop --config /opt/hadoop-1.0.3/libexec/../conf datanode

  # tasktracker
  nohup nice -n 0 /opt/hadoop-1.0.3/libexec/../bin/hadoop --config /opt/hadoop-1.0.3/libexec/../conf tasktracker 

  # secondary namenode
  nohup nice -n 0 /opt/hadoop-1.0.3/libexec/../bin/hadoop --config /opt/hadoop-1.0.3/libexec/../conf secondarynamenode

Verify hadoop is working ok

Check that HDFS is working as expected

root@calx2:~# cd /opt/hadoop-1.0.3/
root@calx2:/opt/hadoop-1.0.3# ./bin/hadoop dfsadmin -report 
Configured Capacity: 5533761699840 (5.03 TB)
Present Capacity: 5144346378240 (4.68 TB)
DFS Remaining: 5144345747456 (4.68 TB)
DFS Used: 630784 (616 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 22 (22 total, 0 dead)

Name: 172.28.0.190:50010
Decommission Status : Normal
Configured Capacity: 251534622720 (234.26 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 17711915008 (16.5 GB)
DFS Remaining: 233822679040(217.76 GB)
DFS Used%: 0%
DFS Remaining%: 92.96%
Last contact: Tue Jul 24 04:11:55 CDT 2012
..