Difference between revisions of "Hadoop: Setup a single host test system"

From Define Wiki
Jump to navigation Jump to search
 
(11 intermediate revisions by 2 users not shown)
Line 6: Line 6:
 
<syntaxhighlight>
 
<syntaxhighlight>
 
   apt-get update
 
   apt-get update
   apt-get install default-jre
+
   apt-get install default-jre openjdk-7-jre
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Line 65: Line 65:
 
<syntaxhighlight>
 
<syntaxhighlight>
 
   export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre
 
   export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre
 +
</syntaxhighlight>
 +
 +
=== Format the namenode ===
 +
<syntaxhighlight>
 +
  ./bin/hadoop namenode -format
 +
</syntaxhighlight>
 +
 +
== Start Hadoop ==
 +
<syntaxhighlight>
 +
  ./bin/start-all.sh
 +
</syntaxhighlight>
 +
 +
== Verify Hadoop ==
 +
 +
=== Check available tests ===
 +
<syntaxhighlight>
 +
root@cal4:/opt/hadoop-1.0.3$ ./bin/hadoop jar hadoop-test-1.0.3.jar
 +
An example program must be given as the first argument.
 +
Valid program names are:
 +
  DFSCIOTest: Distributed i/o benchmark of libhdfs.
 +
  DistributedFSCheck: Distributed checkup of the file system consistency.
 +
  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
 +
  TestDFSIO: Distributed i/o benchmark.
 +
  dfsthroughput: measure hdfs throughput
 +
  filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
 +
  loadgen: Generic map/reduce load generator
 +
  mapredtest: A map/reduce test check.
 +
  mrbench: A map/reduce benchmark that can create many small jobs
 +
  nnbench: A benchmark that stresses the namenode.
 +
  testarrayfile: A test for flat files of binary key/value pairs.
 +
  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
 +
  testfilesystem: A test for FileSystem read/write.
 +
  testipc: A test for ipc.
 +
  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
 +
  testrpc: A test for rpc.
 +
  testsequencefile: A test for flat files of binary key value pairs.
 +
  testsequencefileinputformat: A test for sequence file input format.
 +
  testsetfile: A test for flat files of binary key/value pairs.
 +
  testtextinputformat: A test for text input format.
 +
  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
 +
</syntaxhighlight>
 +
 +
=== Run the DFSIO Test ===
 +
<syntaxhighlight>
 +
root@cal4:/opt/hadoop-1.0.3$ ./bin/hadoop jar hadoop-test*.jar TestDFSIO -write  -nrFiles 10 -fileSize 100 TestDFSIO.0.0.4
 +
12/07/18 12:18:43 INFO fs.TestDFSIO: nrFiles = 10
 +
12/07/18 12:18:43 INFO fs.TestDFSIO: fileSize (MB) = 100
 +
12/07/18 12:18:43 INFO fs.TestDFSIO: bufferSize = 1000000
 +
12/07/18 12:18:45 INFO fs.TestDFSIO: creating control file: 100 mega bytes, 10 files
 +
12/07/18 12:18:45 INFO fs.TestDFSIO: created control files for: 10 files
 +
12/07/18 12:18:46 INFO mapred.FileInputFormat: Total input paths to process : 10
 +
12/07/18 12:18:46 INFO mapred.JobClient: Running job: job_201207171641_0004
 +
12/07/18 12:18:48 INFO mapred.JobClient:  map 0% reduce 0%
 +
12/07/18 12:19:10 INFO mapred.JobClient:  map 20% reduce 0%
 +
12/07/18 12:19:22 INFO mapred.JobClient:  map 40% reduce 6%
 +
12/07/18 12:19:31 INFO mapred.JobClient:  map 40% reduce 13%
 +
12/07/18 12:19:34 INFO mapred.JobClient:  map 60% reduce 13%
 +
12/07/18 12:19:46 INFO mapred.JobClient:  map 80% reduce 20%
 +
12/07/18 12:19:52 INFO mapred.JobClient:  map 80% reduce 26%
 +
12/07/18 12:19:58 INFO mapred.JobClient:  map 100% reduce 26%
 +
12/07/18 12:20:07 INFO mapred.JobClient:  map 100% reduce 100%
 +
12/07/18 12:20:15 INFO mapred.JobClient: Job complete: job_201207171641_0004
 +
12/07/18 12:20:16 INFO mapred.JobClient: Counters: 30
 +
12/07/18 12:20:16 INFO mapred.JobClient:  Job Counters
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Launched reduce tasks=1
 +
12/07/18 12:20:16 INFO mapred.JobClient:    SLOTS_MILLIS_MAPS=119264
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Total time spent by all reduces waiting after reserving slots (ms)=0
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Total time spent by all maps waiting after reserving slots (ms)=0
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Launched map tasks=10
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Data-local map tasks=10
 +
12/07/18 12:20:16 INFO mapred.JobClient:    SLOTS_MILLIS_REDUCES=56575
 +
12/07/18 12:20:16 INFO mapred.JobClient:  File Input Format Counters
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Bytes Read=1120
 +
12/07/18 12:20:16 INFO mapred.JobClient:  File Output Format Counters
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Bytes Written=78
 +
12/07/18 12:20:16 INFO mapred.JobClient:  FileSystemCounters
 +
12/07/18 12:20:16 INFO mapred.JobClient:    FILE_BYTES_READ=851
 +
12/07/18 12:20:16 INFO mapred.JobClient:    HDFS_BYTES_READ=2360
 +
12/07/18 12:20:16 INFO mapred.JobClient:    FILE_BYTES_WRITTEN=238588
 +
12/07/18 12:20:16 INFO mapred.JobClient:    HDFS_BYTES_WRITTEN=1048576078
 +
12/07/18 12:20:16 INFO mapred.JobClient:  Map-Reduce Framework
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Map output materialized bytes=905
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Map input records=10
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Reduce shuffle bytes=815
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Spilled Records=100
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Map output bytes=745
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Total committed heap usage (bytes)=1626120192
 +
12/07/18 12:20:16 INFO mapred.JobClient:    CPU time spent (ms)=58680
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Map input bytes=260
 +
12/07/18 12:20:16 INFO mapred.JobClient:    SPLIT_RAW_BYTES=1240
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Combine input records=0
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Reduce input records=50
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Reduce input groups=5
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Combine output records=0
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Physical memory (bytes) snapshot=1853804544
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Reduce output records=5
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Virtual memory (bytes) snapshot=4109959168
 +
12/07/18 12:20:16 INFO mapred.JobClient:    Map output records=50
 +
12/07/18 12:20:16 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:            Date & time: Wed Jul 18 12:20:16 BST 2012
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:        Number of files: 10
 +
12/07/18 12:20:16 INFO fs.TestDFSIO: Total MBytes processed: 1000
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:      Throughput mb/sec: 21.530379365284418
 +
12/07/18 12:20:16 INFO fs.TestDFSIO: Average IO rate mb/sec: 21.541706085205078
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:  IO rate std deviation: 0.4955591491226172
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:    Test exec time sec: 90.213
 +
12/07/18 12:20:16 INFO fs.TestDFSIO:
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 13:47, 25 July 2012

Tests performed on a single calxeda SOC with ubuntu 12.10

Prerequisites

Install Java/JRE

  apt-get update
  apt-get install default-jre openjdk-7-jre

Setup Passwordless Access

Setup passwordless ssh for user/root (I used root in this example, separate hadoop user should really be setup!)

  ssh-keygen -t rsa
  # dont enter a passphrase, just hit enter twice for a blank passphrase
  cd .ssh
  cat id_rsa.pub >> authorized_keys
  chmod 600 authorized_keys

Install Hadoop

Get latest stable release

The latest release is available from: http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/stable/

  wget http://ftp.heanet.ie/mirrors/www.apache.org/dist/hadoop/common/stable/hadoop-1.0.3.tar.gz
  cd /opt
  tar zxvf /path/to/download/hadoop-1.0.3.tar.gz

Setup Config Files

All files in question here are found in /opt/hadoop-1.0.3

conf/core-site.xml: <xml> <configuration>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>

</configuration> </xml>

conf/hdfs-site.xml: <xml> <configuration>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>

</configuration> </xml>

conf/mapred-site.xml: <xml> <configuration>

    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>

</configuration> </xml>

conf/hadoop-env/sh:

  export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf/jre

Format the namenode

  ./bin/hadoop namenode -format

Start Hadoop

  ./bin/start-all.sh

Verify Hadoop

Check available tests

root@cal4:/opt/hadoop-1.0.3$ ./bin/hadoop jar hadoop-test-1.0.3.jar 
An example program must be given as the first argument.
Valid program names are:
  DFSCIOTest: Distributed i/o benchmark of libhdfs.
  DistributedFSCheck: Distributed checkup of the file system consistency.
  MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
  TestDFSIO: Distributed i/o benchmark.
  dfsthroughput: measure hdfs throughput
  filebench: Benchmark SequenceFile(Input|Output)Format (block,record compressed and uncompressed), Text(Input|Output)Format (compressed and uncompressed)
  loadgen: Generic map/reduce load generator
  mapredtest: A map/reduce test check.
  mrbench: A map/reduce benchmark that can create many small jobs
  nnbench: A benchmark that stresses the namenode.
  testarrayfile: A test for flat files of binary key/value pairs.
  testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
  testfilesystem: A test for FileSystem read/write.
  testipc: A test for ipc.
  testmapredsort: A map/reduce program that validates the map-reduce framework's sort.
  testrpc: A test for rpc.
  testsequencefile: A test for flat files of binary key value pairs.
  testsequencefileinputformat: A test for sequence file input format.
  testsetfile: A test for flat files of binary key/value pairs.
  testtextinputformat: A test for text input format.
  threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill

Run the DFSIO Test

root@cal4:/opt/hadoop-1.0.3$ ./bin/hadoop jar hadoop-test*.jar TestDFSIO -write  -nrFiles 10 -fileSize 100 TestDFSIO.0.0.4
12/07/18 12:18:43 INFO fs.TestDFSIO: nrFiles = 10
12/07/18 12:18:43 INFO fs.TestDFSIO: fileSize (MB) = 100
12/07/18 12:18:43 INFO fs.TestDFSIO: bufferSize = 1000000
12/07/18 12:18:45 INFO fs.TestDFSIO: creating control file: 100 mega bytes, 10 files
12/07/18 12:18:45 INFO fs.TestDFSIO: created control files for: 10 files
12/07/18 12:18:46 INFO mapred.FileInputFormat: Total input paths to process : 10
12/07/18 12:18:46 INFO mapred.JobClient: Running job: job_201207171641_0004
12/07/18 12:18:48 INFO mapred.JobClient:  map 0% reduce 0%
12/07/18 12:19:10 INFO mapred.JobClient:  map 20% reduce 0%
12/07/18 12:19:22 INFO mapred.JobClient:  map 40% reduce 6%
12/07/18 12:19:31 INFO mapred.JobClient:  map 40% reduce 13%
12/07/18 12:19:34 INFO mapred.JobClient:  map 60% reduce 13%
12/07/18 12:19:46 INFO mapred.JobClient:  map 80% reduce 20%
12/07/18 12:19:52 INFO mapred.JobClient:  map 80% reduce 26%
12/07/18 12:19:58 INFO mapred.JobClient:  map 100% reduce 26%
12/07/18 12:20:07 INFO mapred.JobClient:  map 100% reduce 100%
12/07/18 12:20:15 INFO mapred.JobClient: Job complete: job_201207171641_0004
12/07/18 12:20:16 INFO mapred.JobClient: Counters: 30
12/07/18 12:20:16 INFO mapred.JobClient:   Job Counters 
12/07/18 12:20:16 INFO mapred.JobClient:     Launched reduce tasks=1
12/07/18 12:20:16 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=119264
12/07/18 12:20:16 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/07/18 12:20:16 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/07/18 12:20:16 INFO mapred.JobClient:     Launched map tasks=10
12/07/18 12:20:16 INFO mapred.JobClient:     Data-local map tasks=10
12/07/18 12:20:16 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=56575
12/07/18 12:20:16 INFO mapred.JobClient:   File Input Format Counters 
12/07/18 12:20:16 INFO mapred.JobClient:     Bytes Read=1120
12/07/18 12:20:16 INFO mapred.JobClient:   File Output Format Counters 
12/07/18 12:20:16 INFO mapred.JobClient:     Bytes Written=78
12/07/18 12:20:16 INFO mapred.JobClient:   FileSystemCounters
12/07/18 12:20:16 INFO mapred.JobClient:     FILE_BYTES_READ=851
12/07/18 12:20:16 INFO mapred.JobClient:     HDFS_BYTES_READ=2360
12/07/18 12:20:16 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=238588
12/07/18 12:20:16 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1048576078
12/07/18 12:20:16 INFO mapred.JobClient:   Map-Reduce Framework
12/07/18 12:20:16 INFO mapred.JobClient:     Map output materialized bytes=905
12/07/18 12:20:16 INFO mapred.JobClient:     Map input records=10
12/07/18 12:20:16 INFO mapred.JobClient:     Reduce shuffle bytes=815
12/07/18 12:20:16 INFO mapred.JobClient:     Spilled Records=100
12/07/18 12:20:16 INFO mapred.JobClient:     Map output bytes=745
12/07/18 12:20:16 INFO mapred.JobClient:     Total committed heap usage (bytes)=1626120192
12/07/18 12:20:16 INFO mapred.JobClient:     CPU time spent (ms)=58680
12/07/18 12:20:16 INFO mapred.JobClient:     Map input bytes=260
12/07/18 12:20:16 INFO mapred.JobClient:     SPLIT_RAW_BYTES=1240
12/07/18 12:20:16 INFO mapred.JobClient:     Combine input records=0
12/07/18 12:20:16 INFO mapred.JobClient:     Reduce input records=50
12/07/18 12:20:16 INFO mapred.JobClient:     Reduce input groups=5
12/07/18 12:20:16 INFO mapred.JobClient:     Combine output records=0
12/07/18 12:20:16 INFO mapred.JobClient:     Physical memory (bytes) snapshot=1853804544
12/07/18 12:20:16 INFO mapred.JobClient:     Reduce output records=5
12/07/18 12:20:16 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4109959168
12/07/18 12:20:16 INFO mapred.JobClient:     Map output records=50
12/07/18 12:20:16 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
12/07/18 12:20:16 INFO fs.TestDFSIO:            Date & time: Wed Jul 18 12:20:16 BST 2012
12/07/18 12:20:16 INFO fs.TestDFSIO:        Number of files: 10
12/07/18 12:20:16 INFO fs.TestDFSIO: Total MBytes processed: 1000
12/07/18 12:20:16 INFO fs.TestDFSIO:      Throughput mb/sec: 21.530379365284418
12/07/18 12:20:16 INFO fs.TestDFSIO: Average IO rate mb/sec: 21.541706085205078
12/07/18 12:20:16 INFO fs.TestDFSIO:  IO rate std deviation: 0.4955591491226172
12/07/18 12:20:16 INFO fs.TestDFSIO:     Test exec time sec: 90.213
12/07/18 12:20:16 INFO fs.TestDFSIO: