PUMA Benchmarks and dataset downloads

For kmeans, the starting centroids file contains 6 cluster centroids as randomly selected movies. They are currently assumed to be placed at local disk of all nodes at location /localhome/hadoop1/work/kmeans/centroids_file (the path can be changed in the benchmark file).

Single iteration run can be performed using the command:

$ bin/hadoop jar hadoop-*-examples.jar kmeans –m <num_maps> -r <num_reduces> <input_dir> <output_dir>

Multiple-iterations run can be performed using the script kmeansdriver.sh. The above path (for centroids file) also needs to be updated in the script file. The script currently assumes a default threshold value of 0.01 for iteration termination condition. The command to execute the script is:

$ $hadoop/src/examples/org/apache/Hadoop/examples/kmeans/kmeansdriver.sh <num_maps><num_reduces> <input_dir> <output_dir> <num_iterations>

Running kmeans is a 3-step process:

1. copy centroids file to all nodes local disk at a suitable location (they can also be placed at HDFS as part of configuration file and read from there).

2. update the benchmark file and the script file with the path location for the centroid file.

3. Run the benchmark using either of the commands above.

Ranked-Inverted-Index

(uses the output of Multi-word-count)

Dataset1 40GB

Dataset2 110GB

Dataset3 300GB