HiBench Testing
HiBench is a Hadoop benchmark suite. It is good for testing the correctness and health of a Hadoop installation.
- Download HiBench from github:
git clone https://github.com/intel-hadoop/HiBench.gitGetting Started
Prerequisites
- Setup HiBench
- Make sure these things are installed: maven. Then, locate into HiBench/common/hibench and run mvn process-sources to get dependencies.
- Setup Hadood
- Before you run any workload in the package, please verify the Hadoop framework is running correctly. All the workloads have been tested with Cloudera Distribution of Hadoop 5(cdh5.1.0) and Hadoop version 1.0.4 and 2.2.0
- Setup Hive (for hivebench)
- Please make sure you have properly set up Hive in your cluster if you want to test hivebench. Or the benchmark willuse the default release fetched by maven.
Configure for the all workloads
You need to set some global environment variables in the bin/hibench-config.sh file located in the root dir.
export JAVA_HOME=/usr/lib/jvm/java
export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/
export HADOOP_EXECUTABLE=hadoop
export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/
export HADOOP_EXAMPLES_JAR=/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar
export MAPRED_EXECUTABLE=mapred
#Set the varaible below only in YARN mode
export HADOOP_JOBCLIENT_TESTS_JAR=/opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.5.1-mapr-1501-tests.jarThese variables are for use with MapR. (see MapR: Landing Page)
Configure each workload
You can modify the conf/configure.sh file under each workload folder if it exists. All the data size and options related to the workload are defined in this file.
Synchronize the time on all nodes
This is required for dfsioe, and optional for others)
Running
Run several workloads together
The conf/benchmarks.lst file under the package folder defines the workloads to run when you execute the bin/run-all.sh script under the package folder. Each line in the list file specifies one workload. You can use # at the beginning of each line to skip the corresponding bench if necessary.
Run workload in throughput test mode
The conf/benchmarks-concurrent.lst file under the package folder defines the workloads to run when you execute the bin/run-concurrent.sh script under the package folder. The number at the end of each line indicates the number of each workload you want to submit simultaneously. Before running, execute the script bin/prepare-concurrent.sh.
For the workload hivebench, please run the metastore as a service first to support concurrency: hive --service metastore.
Run each workload separately
You can also run each workload separately. In general, there are 3 different files under one workload folder.
conf/configure.sh Configuration file contains all parameters such as data size and test options.
bin/prepare*.sh Generate or copy the job input data into HDFS.
bin/run*.sh Execute the workloadFollow the steps below to run a workload
- Configure the benchmark:
- set your own configurations by modifying configure.sh if necessary
- Prepare data:
- bin/prepare.sh (bin/prepare-read.sh for dfsioe) to prepare input data in HDFS for running the benchmark
- Run the benchmark:
- bin/run*.sh to run the corresponding benchmark