MapReduce
Benchmarks by Faraz Ahmad, Seyong Lee,
Mithuna Thottethodi, T. N. Vijaykumar MapReduce
is a well-known programming model, developed within Google, for processing
large amounts of raw data, for example, crawled
documents or web request logs. This data is usually so large that it must
be distributed across thousands of machines in order to be processed in a
reasonable time. The ease of programmability, automatic data management and
transparent fault tolerance has made MapReduce a favorable choice for
large-scale data centers batch processing. Map, written by a user of the
MapReduce library, takes an input pair and produces a set of intermediate
key/value pairs. The library groups together all intermediate values
associated with the same intermediate key and passes them to the reduce
function through an all-map-to-all-reduce communication called Shuffle.
Reduce, also written by the user, receives intermediate key along with a
set of values from Map and merges together these values to produce the
final output. Hadoop is an open-source implementation of MapReduce which is
being improved and developed regularly by software developers / researchers
and is maintained by Apache Software Foundation. Despite being vast
efforts on the development of Hadoop MapReduce, there has not been a very
rigorous work done on the benchmarks side. During
our work on MapReduce, we developed a benchmark suite which represents a
broad range of MapReduce applications exhibiting application
characteristics with high/low computation and high/low shuffle volumes.
There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count,
and Grep are from Hadoop distribution. The rest of the benchmarks
were developed in-house and are currently not part of the Hadoop
distribution. The three benchmarks from Hadoop distribution are also
slightly modified to take number of reduce tasks as input from the user and
generate final time completion statistics of jobs. The
benchmarks and data sets can be downloaded from here. 1.
Word-Count counts the occurrences of each word in a large collection of documents.
Map emits <word,1> tuples. Reduce adds up the counts for a given word from all map tasks
and outputs the final count. Input format: any document (usually a web document in text/xml format) Output format: <word>
<count> Dataset: Downloaded from http://dumps.wikimedia.org/enwiki/
. Due to HDFS (Hadoop File System) limitations, the datasets
needed some processing such as (i) copying all files from multiple
hierarchical directories to one directory, (ii) merging multiple files
together to create small number of large-sized files rather than large
number of small-sized files, and (iii) eliminating special character file
names. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar wordcount –r <num-reduces>
<input-dir> <output-dir> 2.
Inverted-Index takes a list of documents as input and generates word-to-document
indexing. Map emits <word,
docId> tuples with
each word emitted once per docId. Reduce combines all tuples on key <word> and emits <word,list(docId)> tuples after removing duplicates. Input format: any document (usually a web document in text/xml format) Output format: <word>
<docId> Dataset: Same as for Word-Count. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar invertedindex –m<num-maps>
-r <num-reduces> <input-dir> <output-dir> Note: To let Hadoop
figure out the number of map tasks, supply –m 1
here. 3.
Term-Vector determines the most frequent words in a set of documents and is useful in the
analyses of a host’s relevance to a search. Map emits <host,termvector> tuples where termvector is itself a tuple
of the form <word,
1>. Reduce discards the words whose frequency is
below some cut-off, sorts the rest of the list per key in a descending
order with respect to count and emits tuples of the form <host, list(termvector)>. Input format: any document (usually a web document in text/xml format) Output format: <host>
<termvector> Dataset: Same as for Word-Count. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar termvectorperhost –m<num-maps>
-r <num-reduces> <input-dir> <output-dir> 4.
Self-Join is similar to the
candidate generation part of the a priori
data mining algorithm to generate association among k+1 fields
given the set of k-field associations. Map receives candidate lists of the
form {element1,
element2, ...., elementk}, each list in alphanumerically sorted order. Map breaks these lists
into <{element1, element2,
....,elementk-1}, {elementk}> tuples. Reduce prepares a sorted list of all the Map values for a
given key by building <{element1,
element2, ...., elementk-1}, {val1,val2,
...., valj}> tuples. From these tuples, (k+1)-sized candidates can be obtained
by appending consecutive pairs of map values vali, vali+1 to the (k-1)-sized key. By avoiding repetition of (k-1)-sized keys
for every pair of values in the list, tuples are a compact representation
of the (k+1)-sized candidates set. Input format: {e1,e2,
….., ek} Output format: <e1,e2, ….. ek-1>< ek, ek+1 > Dataset: Synthetic data (details here) Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar selfjoin –m<num-maps>
-r <num-reduces> <input-dir> <output-dir> 5.
Adjacency-List is similar to search-engine computation to generate adjacency and
reverse adjacency lists of nodes of a graph for use by PageRank-like
algorithms. Map receives as inputs graph edges <p, q> of a directed graph that follows the power law of the World-wide
Web. For the input, we assume the probability, that a node has an
out-degree of i, is proportional to 1/iskew
with an average out-degree of 7.2. Map emits tuples of the form <q, from_list{p}:to_list{}> and <p, from_list{}:to_list{q}>. For a given key, Reduce generates unions of the respective lists
in the from_list and to_list fields, sorts the items within the union lists, and emits <p(and q), from_list{sorted union of all individual from_list}:to_list{sorted
union of all individual to_list}> tuples. Input format: {p,q} Output format: <p><from{list_of_in_degree}:to{list_of_out_degree}>
, <q><from{list_of_in_degree}:to{list_of_out_degree}> Dataset: Synthetic data (details here) Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar adjlist –m<num-maps> -r
<num-reduces> <input-dir>
<output-dir> 6. K-Means
is a popular data mining algorithm to cluster input data into k
clusters. K-means iterates to successively improve the clustering. We
classify movies based on their ratings using anonymized
movies rating data which is of the form <movie_id: list{rater_id, rating}>. We use random starting values for the cluster centroids. Map
computes the cosine-vector similarity of a given movie with the centroids,
and determines the centroid to which the movie is closest (i.e., the cluster
to which it belongs). Map emits <centroid_id,
(similarity_value, movie_data)> where movie_data is (movie_id:list{rater_id, rating}). Reduce
determines the new centroids by computing the average of similarity of all
the movies in a cluster. The movie closest to the average is the new
centroid and Reduce emits movies data along with their current centroid as
well as new centroids data (model file) for use in the next iteration. The
algorithm iterates until the change in the centroids is below a threshold. Input Format: {movie_id: userid1_rating1, userid2_rating2,
...} Output Format: kmeans produces two types of outputs: (a) <centroid_num><{movie_id:
userid1_rating1, userid2_rating2, ...}>
(list of all movies associated with a particular centroid) (b) <centroid_num><{similarity_value}{centroid_movie_id}{num_members}{userid1_rating1,
userid2_rating2, …}> (new centroid} Datasets: movie ratings dataset. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar kmeans_itertxt_hr –m <num-maps>
-r <num-reduces> <input-dir> <output-dir> Note: To run multiple
iterations of kmeans, a wrapper script should be
executed. The details can be found here. 7.
Classification classifies the movies into one of k pre-determined clusters. Similar to
k-means, classification uses anonymized movies
rating data which is of the form <movie_id: list{rater_id, rating}>. Similar to k-means, Map computes the cosine vector similarity of a
given movie with the centroids, and determines the centroid to which the
movie is closest (i.e., the cluster to which it belongs). Map emits <centroid_id, movie_id)>. Unlike k-means, the details of movie ratings are not emitted
because there are no further iterations which may need the details. Reduce
collects all the movies in a cluster and emits <centroid_id, movie_id>. Input Format: {movie_id: userid1_rating1, userid2_rating2,
….} Output Format: <centroid_num><movieid> Datasets: movie ratings dataset. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar
classification –m <num-maps> -r <num-reduces> <input-dir>
<output-dir> 8.
Histogram-Movies generates a histogram of input data and is a generic tool used in many data
analyses. We use the movie rating data and the input is of the form <movie_id: list{rater_id, rating}>.
Based on the average ratings of movies (ratings range from 1 to 5) we bin
the movies into 8 bins each with a range of 0.5. Map computes the average
rating for a movie, determines the bin, and emits <bin_value, 1> tuples.
Reduce collects all the tuples for a bin and outputs a <bin_value, n> tuple. Input Format: {movie_id: userid1_rating1, userid2_rating2,
….} Output Format: <bin_value><num_of_movies> Datasets: movie ratings dataset. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar histogram_movies –m <num-maps>
-r <num-reduces> <input-dir> <output-dir> 9.
Histogram-Ratings generates a histogram of the user ratings trend. The input is of the form <movie_id: list{rater_id, rating}>.
Here, we bin the user ratings of 1-5 into 5 bins and Map emits <rating, 1> tuple for each review. Reduce collects all the tuples for a rating
and emits a <rating,
n> tuple. Input Format: {movie_id: userid1_rating1, userid2_rating2,
….} Output Format: <rating
><num_of_user_reviews> Datasets: movie ratings dataset. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar histogram_ratings –m <num-maps>
-r <num-reduces> <input-dir> <output-dir> 10.
Sequence-Count generates a count of all unique sets of three consecutive words per document
in the input data. Map emits <word1|word2|word3|filename, 1> tuples. Reduce adds up the counts for the
multi-words from all map tasks and outputs the final count. Input format: any document (usually a web document in text/xml format) Output format: <word1|word2|word3|filename>
<count> Dataset: Same as for Word-Count. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar sequencecounts –m <num-maps>
–r <num-reduces> <input-dir> <output-dir> 11.
Ranked-Inverted-Index takes list of words and their frequencies per document and generates
lists of documents containing the given words in decreasing order of
frequency. Map takes sequence-count benchmark’s output <word-sequence|filename,n> as its input and separates counts from the rest of the data in the
input. Map output format is <word-sequence, {filename,n}>. Reduce
takes all map outputs and produces a list per word-sequence in decreasing
order of occurrence in the respective documents <word-sequence><{count1, file1},{count2, file2}, …>. Input format: <word-sequence|filename><count> Output format: <word-sequence>
<count | file> Dataset: Output of Sequence-Count Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar rankedinvertedindex –m <num-maps>
–r <num-reduces> <input-dir> <output-dir> 12.
Tera-Sort sorts 100-byte <key,value> tuples on
the keys where key is a 10-byte field and the rest of the bytes as value
(payload). Map is identity function which simply reads and emits the tuples
and Reduce emits the sorted data to the final output. The sorting occurs in
MapReduce’s in-built sort while reduce tasks simply emit the sorted tokens. Input format: {10-bytes}{90-bytes} Output format: <10-bytes><90-bytes> Dataset: Generated through TeraGen in Hadoop. Here
is a sample 3GB dataset. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar terasort <input-dir>
<output-dir> <num-reduces> 13. Grep searches for a pattern in a file and is a generic search tool used in many
data analyses. Each map task outputs lines containing either of the
patterns as <regex,
1> tuples. Reduce task adds
up the counts and emits <regex, n> tuples. Input format: any document (usually a web document in text/xml format) Output format: <regex>
<count> Dataset: Same as for Word-Count. Command-line execution: $
bin/hadoop jar hadoop-*-examples.jar grep
<input-dir> <output-dir>
<num-reduces> <regex> [<group>] |