HiBench Testing

HiBench is a Hadoop benchmark suite. It is good for testing the correctness and health of a Hadoop installation.

Download HiBench from github:

git clone https://github.com/intel-hadoop/HiBench.git

Getting Started

Prerequisites

Setup HiBench
Make sure these things are installed: maven. Then, locate into HiBench/common/hibench and run mvn process-sources to get dependencies.

Setup Hadood
Before you run any workload in the package, please verify the Hadoop framework is running correctly. All the workloads have been tested with Cloudera Distribution of Hadoop 5(cdh5.1.0) and Hadoop version 1.0.4 and 2.2.0

Setup Hive (for hivebench)
Please make sure you have properly set up Hive in your cluster if you want to test hivebench. Or the benchmark willuse the default release fetched by maven.

Configure for the all workloads

You need to set some global environment variables in the bin/hibench-config.sh file located in the root dir.

export JAVA_HOME=/usr/lib/jvm/java
export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/
export HADOOP_EXECUTABLE=hadoop
export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/
export HADOOP_EXAMPLES_JAR=/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar
export MAPRED_EXECUTABLE=mapred
#Set the varaible below only in YARN mode
export HADOOP_JOBCLIENT_TESTS_JAR=/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-mapr-1501-tests.jar

These variables are for use with MapR. (see MapR: Landing Page)

Configure each workload

You can modify the conf/configure.sh file under each workload folder if it exists. All the data size and options related to the workload are defined in this file.

Synchronize the time on all nodes

This is required for dfsioe, and optional for others)

Running

Run several workloads together

The conf/benchmarks.lst file under the package folder defines the workloads to run when you execute the bin/run-all.sh script under the package folder. Each line in the list file specifies one workload. You can use # at the beginning of each line to skip the corresponding bench if necessary.

Run workload in throughput test mode

The conf/benchmarks-concurrent.lst file under the package folder defines the workloads to run when you execute the bin/run-concurrent.sh script under the package folder. The number at the end of each line indicates the number of each workload you want to submit simultaneously. Before running, execute the script bin/prepare-concurrent.sh.

For the workload hivebench, please run the metastore as a service first to support concurrency: hive --service metastore.

Run each workload separately

You can also run each workload separately. In general, there are 3 different files under one workload folder.

conf/configure.sh   Configuration file contains all parameters such as data size and test options.
bin/prepare*.sh   Generate or copy the job input data into HDFS.
bin/run*.sh       Execute the workload

Follow the steps below to run a workload

Configure the benchmark:
set your own configurations by modifying configure.sh if necessary
Prepare data:
bin/prepare.sh (bin/prepare-read.sh for dfsioe) to prepare input data in HDFS for running the benchmark
Run the benchmark:
bin/run*.sh to run the corresponding benchmark