PUMA Benchmarks and dataset downloads

The extracted benchmarks can replace the existing $HADOOP/src/examples/org/apache/hadoop/examples directory in Hadoop-0.20.X (and Hadoop-1.0.0) compatible releases. The directory contains all the existing Hadoop benchmarks as well as the benchmarks from our suite.

Benchmarks (tar.gz)




(can be generated through TeraGen in Hadoop MapReduce)


Dataset1 30GB

Dataset2 150GB

Dataset3 300GB



(for benchmarks Word-Count, Grep, Inverted-index, Term-vector, Multi-wordcount)


Dataset1 50GB

Dataset2 140GB

Dataset2 150GB

Dataset3 300GB




Data generation script

Dataset1 30GB

Dataset2 80GB

Dataset3 150GB

Dataset4 300GB




Data generation script

Dataset1 30GB

Dataset2 150GB



(for benchmarks Kmeans, Classification, Histogram-Movies, Histogram-Ratings)


Dataset1 30GB

Dataset2 100GB

Dataset3 300GB

starting centroids (for kmeans, classification)


For kmeans, the starting centroids file contains 6 cluster centroids as randomly selected movies. They are currently assumed to be placed at local disk of all nodes at location /localhome/hadoop1/work/kmeans/centroids_file (the path can be changed in the benchmark file).

Single iteration run can be performed using the command:

$ bin/hadoop jar hadoop-*-examples.jar kmeans m <num_maps> -r <num_reduces> <input_dir> <output_dir>


Multiple-iterations run can be performed using the script kmeansdriver.sh. The above path (for centroids file) also needs to be updated in the script file. The script currently assumes a default threshold value of 0.01 for iteration termination condition. The command to execute the script is:

$ $hadoop/src/examples/org/apache/Hadoop/examples/kmeans/kmeansdriver.sh <num_maps><num_reduces> <input_dir> <output_dir> <num_iterations>


Running kmeans is a 3-step process:

1.      copy centroids file to all nodes local disk at a suitable location (they can also be placed at HDFS as part of configuration file and read from there).

2.      update the benchmark file and the script file with the path location for the centroid file.

3.      Run the benchmark using either of the commands above.




(uses the output of Multi-word-count)


Dataset1 40GB

Dataset2 110GB

Dataset3 300GB