(can
be generated through TeraGen in Hadoop MapReduce)
Wikipedia
(for
benchmarks Word-Count, Grep, Inverted-index, Term-vector, Multi-wordcount)
Self-Join
Adjacency-List
Movies-database
(for
benchmarks Kmeans, Classification, Histogram-Movies,
Histogram-Ratings)
starting centroids (for kmeans,
classification)
For kmeans, the starting centroids file
contains 6 cluster centroids as randomly selected movies. They are currently
assumed to be placed at local disk of all nodes at location /localhome/hadoop1/work/kmeans/centroids_file (the path can be changed in the benchmark file).
Single iteration run can be performed using the command:
$ bin/hadoop
jar hadoop-*-examples.jar kmeans –m <num_maps> -r <num_reduces>
<input_dir> <output_dir>
Multiple-iterations run can be performed using the script kmeansdriver.sh. The above path (for centroids file)
also needs to be updated in the script file. The script currently assumes a
default threshold value of 0.01 for iteration termination condition. The command
to execute the script is:
$
$hadoop/src/examples/org/apache/Hadoop/examples/kmeans/kmeansdriver.sh <num_maps><num_reduces>
<input_dir> <output_dir>
<num_iterations>
Running kmeans is a 3-step process:
1.
copy centroids file to all nodes local disk at a
suitable location (they can also be placed at HDFS as part of configuration
file and read from there).
2.
update the benchmark
file and the script file with the path location for the centroid file.
3.
Run the benchmark using either of the commands above.
Ranked-Inverted-Index
(uses the output of Multi-word-count)