| MapReduce
    Benchmarks by Faraz Ahmad, Seyong Lee,
    Mithuna Thottethodi, T. N. Vijaykumar MapReduce
    is a well-known programming model, developed within Google, for processing
    large amounts of raw data, for example, crawled
    documents or web request logs. This data is usually so large that it must
    be distributed across thousands of machines in order to be processed in a
    reasonable time. The ease of programmability, automatic data management and
    transparent fault tolerance has made MapReduce a favorable choice for
    large-scale data centers batch processing. Map, written by a user of the
    MapReduce library, takes an input pair and produces a set of intermediate
    key/value pairs. The library groups together all intermediate values
    associated with the same intermediate key and passes them to the reduce
    function through an all-map-to-all-reduce communication called Shuffle.
    Reduce, also written by the user, receives intermediate key along with a
    set of values from Map and merges together these values to produce the
    final output. Hadoop is an open-source implementation of MapReduce which is
    being improved and developed regularly by software developers / researchers
    and is maintained by Apache Software Foundation.  Despite being vast
    efforts on the development of Hadoop MapReduce, there has not been a very
    rigorous work done on the benchmarks side.  During
    our work on MapReduce, we developed a benchmark suite which represents a
    broad range of MapReduce applications exhibiting application
    characteristics with high/low computation and high/low shuffle volumes.
    There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count,
    and Grep are from Hadoop distribution. The rest of the benchmarks
    were developed in-house and are currently not part of the Hadoop
    distribution. The three benchmarks from Hadoop distribution are also
    slightly modified to take number of reduce tasks as input from the user and
    generate final time completion statistics of jobs. The
    benchmarks and data sets can be downloaded from here.   1.
    Word-Count counts the occurrences of each word in a large collection of documents.
    Map emits <word,1> tuples. Reduce adds up the counts for a given word from all map tasks
    and outputs the final count.   Input format: any document (usually a web document in text/xml format)   Output format: <word>
    <count> Dataset: Downloaded from http://dumps.wikimedia.org/enwiki/
    . Due to HDFS (Hadoop File System) limitations, the datasets
    needed some processing such as (i) copying all files from multiple
    hierarchical directories to one directory, (ii) merging multiple files
    together to create small number of large-sized files rather than large
    number of small-sized files, and (iii) eliminating special character file
    names.  Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar wordcount –r <num-reduces>
    <input-dir> <output-dir>     2.
    Inverted-Index takes a list of documents as input and generates word-to-document
    indexing. Map emits <word,
    docId> tuples with
    each word emitted once per docId. Reduce combines all tuples on key <word> and emits <word,list(docId)> tuples after removing duplicates.   Input format: any document (usually a web document in text/xml format)  Output format: <word>
    <docId> Dataset: Same as for Word-Count. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar invertedindex –m<num-maps>
    -r <num-reduces> <input-dir> <output-dir> Note: To let Hadoop
    figure out the number of map tasks, supply –m 1
    here.     3.
    Term-Vector  determines the most frequent words in a set of documents and is useful in the
    analyses of a host’s relevance to a search. Map emits <host,termvector> tuples where termvector is itself a tuple
    of the form <word,
    1>. Reduce discards the words whose frequency is
    below some cut-off, sorts the rest of the list per key in a descending
    order with respect to count and emits tuples of the form <host, list(termvector)>.   Input format: any document (usually a web document in text/xml format)  Output format: <host>
    <termvector> Dataset: Same as for Word-Count. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar termvectorperhost –m<num-maps>
    -r <num-reduces> <input-dir> <output-dir>     4.
    Self-Join is similar to the
    candidate generation part of the a priori
    data mining algorithm to generate association among k+1 fields
    given the set of k-field associations. Map receives candidate lists of the
    form {element1,
    element2, ...., elementk}, each list in alphanumerically sorted order. Map breaks these lists
    into <{element1, element2,
    ....,elementk-1}, {elementk}> tuples. Reduce prepares a sorted list of all the Map values for a
    given key by building <{element1,
    element2, ...., elementk-1}, {val1,val2,
    ...., valj}> tuples. From these tuples, (k+1)-sized candidates can be obtained
    by appending consecutive pairs of map values vali, vali+1 to the (k-1)-sized key. By avoiding repetition of (k-1)-sized keys
    for every pair of values in the list, tuples are a compact representation
    of the (k+1)-sized candidates set.   Input format: {e1,e2,
    ….., ek}  Output format: <e1,e2, ….. ek-1>< ek, ek+1 > Dataset: Synthetic data (details here) Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar selfjoin –m<num-maps>
    -r <num-reduces> <input-dir> <output-dir>     5.
    Adjacency-List is similar to search-engine computation to generate adjacency and
    reverse adjacency lists of nodes of a graph for use by PageRank-like
    algorithms. Map receives as inputs graph edges <p, q> of a directed graph that follows the power law of the World-wide
    Web. For the input, we assume the probability, that a node has an
    out-degree of i, is proportional to 1/iskew
    with an average out-degree of 7.2. Map emits tuples of the form <q, from_list{p}:to_list{}> and <p, from_list{}:to_list{q}>. For a given key, Reduce generates unions of the respective lists
    in the from_list and to_list fields, sorts the items within the union lists, and emits <p(and q), from_list{sorted union of all individual from_list}:to_list{sorted
    union of all individual to_list}> tuples.    Input format: {p,q}  Output format: <p><from{list_of_in_degree}:to{list_of_out_degree}>
    , <q><from{list_of_in_degree}:to{list_of_out_degree}> Dataset: Synthetic data (details here) Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar adjlist –m<num-maps> -r
    <num-reduces> <input-dir>
    <output-dir>     6. K-Means
     is a popular data mining algorithm to cluster input data into k
    clusters. K-means iterates to successively improve the clustering. We
    classify movies based on their ratings using anonymized
    movies rating data which is of the form <movie_id: list{rater_id, rating}>. We use random starting values for the cluster centroids. Map
    computes the cosine-vector similarity of a given movie with the centroids,
    and determines the centroid to which the movie is closest (i.e., the cluster
    to which it belongs). Map emits <centroid_id,
    (similarity_value, movie_data)> where movie_data is (movie_id:list{rater_id, rating}). Reduce
    determines the new centroids by computing the average of similarity of all
    the movies in a cluster. The movie closest to the average is the new
    centroid and Reduce emits movies data along with their current centroid as
    well as new centroids data (model file) for use in the next iteration. The
    algorithm iterates until the change in the centroids is below a threshold.    Input Format: {movie_id: userid1_rating1, userid2_rating2,
    ...} Output Format: kmeans produces two types of outputs: (a) <centroid_num><{movie_id:
    userid1_rating1, userid2_rating2, ...}>
    (list of all movies associated with a particular centroid) (b) <centroid_num><{similarity_value}{centroid_movie_id}{num_members}{userid1_rating1,
    userid2_rating2, …}> (new centroid} Datasets: movie ratings dataset. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar kmeans_itertxt_hr –m <num-maps>
    -r <num-reduces> <input-dir> <output-dir> Note: To run multiple
    iterations of kmeans, a wrapper script should be
    executed. The details can be found here.     7.
    Classification  classifies the movies into one of k pre-determined clusters. Similar to
    k-means, classification uses anonymized movies
    rating data which is of the form <movie_id: list{rater_id, rating}>. Similar to k-means, Map computes the cosine vector similarity of a
    given movie with the centroids, and determines the centroid to which the
    movie is closest (i.e., the cluster to which it belongs). Map emits <centroid_id, movie_id)>. Unlike k-means, the details of movie ratings are not emitted
    because there are no further iterations which may need the details. Reduce
    collects all the movies in a cluster and emits <centroid_id, movie_id>.   Input Format: {movie_id: userid1_rating1, userid2_rating2,
    ….} Output Format: <centroid_num><movieid> Datasets: movie ratings dataset. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar
    classification –m <num-maps> -r <num-reduces> <input-dir>
    <output-dir>     8.
    Histogram-Movies  generates a histogram of input data and is a generic tool used in many data
    analyses. We use the movie rating data and the input is of the form <movie_id: list{rater_id, rating}>.
    Based on the average ratings of movies (ratings range from 1 to 5) we bin
    the movies into 8 bins each with a range of 0.5. Map computes the average
    rating for a movie, determines the bin, and emits <bin_value, 1> tuples.
    Reduce collects all the tuples for a bin and outputs a <bin_value, n> tuple.   Input Format: {movie_id: userid1_rating1, userid2_rating2,
    ….} Output Format: <bin_value><num_of_movies> Datasets: movie ratings dataset. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar histogram_movies –m <num-maps>
    -r <num-reduces> <input-dir> <output-dir>     9.
    Histogram-Ratings  generates a histogram of the user ratings trend. The input is of the form <movie_id: list{rater_id, rating}>.
    Here, we bin the user ratings of 1-5 into 5 bins and Map emits <rating, 1> tuple for each review. Reduce collects all the tuples for a rating
    and emits a <rating,
    n> tuple.   Input Format: {movie_id: userid1_rating1, userid2_rating2,
    ….} Output Format: <rating
    ><num_of_user_reviews> Datasets: movie ratings dataset. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar histogram_ratings –m <num-maps>
    -r <num-reduces> <input-dir> <output-dir>     10.
    Sequence-Count  generates a count of all unique sets of three consecutive words per document
    in the input data. Map emits <word1|word2|word3|filename, 1> tuples. Reduce adds up the counts for the
    multi-words from all map tasks and outputs the final count.   Input format: any document (usually a web document in text/xml format)  Output format: <word1|word2|word3|filename>
    <count> Dataset: Same as for Word-Count. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar sequencecounts –m <num-maps>
    –r <num-reduces> <input-dir> <output-dir>     11.
    Ranked-Inverted-Index  takes list of words and their frequencies per document and generates
    lists of documents containing the given words in decreasing order of
    frequency. Map takes sequence-count benchmark’s output <word-sequence|filename,n> as its input and separates counts from the rest of the data in the
    input. Map output format is <word-sequence, {filename,n}>. Reduce
    takes all map outputs and produces a list per word-sequence in decreasing
    order of occurrence in the respective documents <word-sequence><{count1, file1},{count2, file2}, …>.    Input format: <word-sequence|filename><count>  Output format: <word-sequence>
    <count | file> Dataset: Output of Sequence-Count Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar rankedinvertedindex –m <num-maps>
    –r <num-reduces> <input-dir> <output-dir>     12.
    Tera-Sort  sorts 100-byte <key,value> tuples on
    the keys where key is a 10-byte field and the rest of the bytes as value
    (payload). Map is identity function which simply reads and emits the tuples
    and Reduce emits the sorted data to the final output. The sorting occurs in
    MapReduce’s in-built sort while reduce tasks simply emit the sorted tokens.   Input format: {10-bytes}{90-bytes}   Output format: <10-bytes><90-bytes> Dataset: Generated through TeraGen in Hadoop. Here
    is a sample 3GB dataset. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar terasort <input-dir>
    <output-dir> <num-reduces>     13. Grep  searches for a pattern in a file and is a generic search tool used in many
    data analyses. Each map task outputs lines containing either of the
    patterns as <regex,
    1> tuples. Reduce task adds
    up the counts and emits <regex, n> tuples.   Input format: any document (usually a web document in text/xml format)  Output format: <regex>
    <count> Dataset: Same as for Word-Count. Command-line execution: $
    bin/hadoop jar hadoop-*-examples.jar grep
    <input-dir> <output-dir>
    <num-reduces> <regex> [<group>]     |