The Recursive Benchmark Set for PLDI2015 Artifact Evaluation ------------------------------------------------------------- ************************************************************************************** *Version: 1.0 *Author: Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, Milind Kulkarni *Update: Feb 13, 2015 *Paper: Efficient Execution of Recursive Programs on Commodity Vector Hardware *Benchs: fib, binomial, nqueens, knapsack, graphcol, uts, parentheses, minmax ************************************************************************************** The overall objective of this document is to help users quickly start to use this benchmark set, and it contains the following descriptions: 0. Artifact evaluation expectations; 1. The evaluation environment; 2. The benchmark directory structure; 3. How to build the benchmarks; 4. How to run the benchmarks; 5. How to collect the numbers. -------------------------------------------------------------------------------------- 0. Artifact evaluation expectations ----------------------------------- Our paper presents techniques for restructuring recursive, task-parallel programs to run on vector hardware, and additional transformations to maintain good vector utilization. For artifact evaluation, we present our benchmarks in their original form and in several transformed forms, so that the reviewers can verify the results presented in the paper, especially, Table 2, Figures 8 through 10, and Figures 13 through 16. Other data can also be verified (except Table 4, which is the performance report of our multi-core versions). In sum, by running the benchmarks built in the build-host, build-mic, build-profile, build-distrib, build-reexp, and build-nostreamcompaction directories, users can verify the following data in the paper: i) The execution time / speedup for each version ii) Cache miss profiling data iii) SIMD utilization iv) Other profiling results 1. The evaluation environment ----------------------------- This benchmark has been comprehensively tested in the following environment: a. For the CPU versions: Intel E5-2670 (8-core, 2.6GHz) Sandy Bridge, 32KB L1 Cache/Core, 20MB last level cache, 128-bit SSE4.2 b. For the Xeon Phi versions: Intel Xeon Phi 61-core SE10P coprocessor (1.1 GHz), 32KB L1 Cache/Core, 512KB L2 Cache/Core, 512-bit AVX512 All codes are compiled with Intel icc-13.3.163 compiler with -O3 optimization (cpu flags: -O3 -DNDEBUG -unroll -DSSE41 -D__SSE4_1 -msse4.1). The Xeon Phi codes are compiled with "-mmic" (mic flags: -O3 -funroll-loops -mmic), running in native mode. All codes are single core versions. In case the users do not have Intel icc compiler available, the sequential code without any SSE or AVX (target "plain" in the rules.mk) can be built and run on CPU host only using the g++ compiler (make plain CC=g++). 2. The benchmark directory structure ------------------------------------ This benchmark set contains the following sub-directories: a. benchmark application source codes (generically referred to as ): fib, binomial, nqueens, graphcol, knapsack, uts, parentheses, minmax. Each directory contains seven versions: -base : base version; -block-noreexp : blocked breadth-first execution followed by depth-first without reexpansion optimization and without simd; -block-reexp : blocked breadth-first execution followed by depth-first with reexpansion optimization and without simd; -block-noreexp-sse: blocked breadth-first execution followed by depth-first without reexpansion optimization and with sse; -block-reexp-sse : blocked breadth-first execution followed by depth-first with reexpansion optimization and with sse; -block-noreexp-mic: blocked breadth-first execution followed by depth-first without reexpansion optimization and with avx512; -block-reexp-mic : blocked breadth-first execution followed by depth-first with reexpansion optimization and with avx512; The sse versions are only executable on the CPU, while the mic versions are only executable on the Xeon Phi. For all blocked versions, we have corresponding "block-(sse|mic).h" header file to provide the necessary "blocked software stack management code". For some benchmarks, such as knapsack and uts, we have some more files, and the basic rule to organize these files is that the files ended with "-sse"|"-mic" are the files that are only used by "sse"|"mic" code, and the other files are shared by sequential codes. b. Three shared utility directories: common : some common header files for sequential and vectorized code; block : some block profiler codes; harness: the common "main" function for all benchmarks. c. An input directory -- inputs: data38N64E.col: used by graphcol, and users DON'T need to specify it during the running of this code; knapsack-example1/2/3.input: used by knapsack, and users can use "-benchmark short", "-benchmark medium" and "-benchmark long" to specify the input, and this will be explained in Section 3. uts_test20.input: used by uts, and users need to specify it. d. Six build directories for users to test our benchmarks (build the code at first): build-host: After build (see Section 3), all the CPU versions will be included in this directory. Users can test the absolute performance of each version to verify the efficacy of our "simd blocked" method. Please refer to Section 4 and 5 for more details. build-mic: Similar to build-host, this one contains all the XEON PHI versions. build-profile: This directory contains the versions with profile information. By running this codes, users can mainly verify the SIMD utilization data for CPU SEQUENTIAL BLOCK versions. build-distrib: This directory contains the CPU SEQUENTIAL BLOCK NOREEXP version for all benchmarks to collect the task distribution information. build-reexp: This directory contains the CPU SEQUENTIAL BLOCK REEXP version for all benchmarks to collect the reexpansion benefit information. build-nostreamcompaction: This directory contains the CPU and XEON PHI versions for fib and nqueens to show the simd performance without stream compaction optimization. NOTE: 1) If icc compiler and Xeon Phi is available, users can build all of these versions, otherwise, if users do not have icc compiler, please use "make plain CC=g++", and users can run the CPU SEQUENTIAL BLOCK versions. In such case, please only verify the "SIMD" utilization profiling data. Such profiling is a sequential simulation, so we don't need any real simd support. Please check Section 4,5 for details about running such profiling. 2) The absolute performance reported in our paper is only for our specific execution environment, and if you have a different one, the performance may be different due to different compiler versions, CPU frequencies or cache sizes and cache behavior. e. The Makefile and related files for building: Makefile, common.mk, and rules.mk 3. How to build the benchmarks ------------------------------- If icc compiler and a Xeon Phi coprocessor are available, please type the make command in the benchmarks/ (root of the artifact package) directory to build the executables: $ make It builds all versions in the build-host/, build-mic/, build-profile/, build-distrib/, build-reexp/, and build-nostreamcompaction/ directories. Here are some recommendations if you don't have icc compiler or Xeon Phi: 1) Without the icc compiler, users can NOT compile the SIMD versions including sse and mic versions. For example, if you have only GNU g++ installed, please use the following command in benchmarks/ directory to make the base version, and sequential blocked versions on CPU -- please NOTE: in such case, you can NOT verify the absolute performance data in our paper, but can still verify the SIMD utilization (Figure 8): $ make plain CC=g++ 2) If users have icc compiler available, but don't have a Xeon Phi coprocessor, please build as: $ make cpu These are all CPU versions. In this case, you can NOT verify the performance data on MIC. 4. How to run the benchmarks ---------------------------- Below are instructions to execute these benchmarks from the command line. We explain the procedure in two steps for each benchmark: step 1 -- the command line parameter format; step 2 -- the application parameters used in our paper as an example. a. fib: base version: ./fib-base.x other versions: ./fib-XX.x Example: ./fib-base.x 45 or ./fib-block-reexp-sse.x 45 9 Here 45 is the fib-number used in our paper, while 9 denotes the block size is 2^9, i.e. 512, which is the best block size for this version -- Please check our paper for more information about the best block sizes for other versions. b. binomial: base version: ./binomial-base other versions: ./binomial-XX Example: ./binomial-base.x 36 13 or ./binomial-block-reexp-sse.x 36 13 18 Here 36, 13 are the application inputs used in our paper, while 18 denotes the block size is 2^18 (262144) -- the best block size for this version. c. nqueens: base version: ./nqueens-base.x <#queens> other versions: ./nqueens-XX.x <#queens> Example: ./nqueens-base.x 13 or ./nqueens-block-reexp-sse.x 13 15 Here 13 is the number of queens used in our paper, while 15 denotes the block size is 2^15 (32768) -- the best block size for this version. d. graphcol: base version: ./graphcol-base.x <#color> other versions: ./graphcol-XX.x <#color> Example: ./graphcol-base.x 3 or ./graphcol-block-reexp-sse.x 3 8 Here 3 means we are doing a 3-color problem used in our paper, while 8 denotes the block size is 2^8 (256) -- the best block size for this version. e. knapsack: base version: ./knapsack-base.x -benchmark other versions: ./knapsack-XX.x -benchmark -b Example: ./knapsack-base.x -benchmark long or ./knapsack-block-reexp-sse.x -benchmark long -b 11 Here "-benchmark long" means that we use the largest data set (inputs/knapsack-example3.input) from the cilk benchmark set in our paper, while "-b 11" means that the block size is 2^11 (2048) -- the best block size for this version. NOTE: we need to use "-b 11" rather than "11". f. uts: base version: ./uts-base.x other versions: ./uts-XX.x Example: ./uts-base.x ../inputs/uts_test20.input or ./uts-block-reexp-sse.x ../inputs/uts_test20.input 16384 Here ../inputs/uts_test20.input is our input configuration, and 16384 is the best block size for this version. NOTE: here the block size is in the absolute value, rather than 2^n format. This is because we also want to support some block size like 1000, 2000, and so on for this benchmark. g. parentheses: base version: ./parentheses-base.x other versions: ./parentheses-XX.x Example: ./parentheses-base.x 19 or ./parentheses-block-reexp-sse.x 19 11 Here 19 is the number of generated parentheses pairs used in our paper, while 11 denotes the block size is 2^11 (2048) -- the best block size for this version. h. minmax: base version: ./minmax-base.x <#position> other versions: ./minmax-XX.x <#position> Example: ./minmax-base.x 12 or ./minmax-block-reexp-sse.x 12 10 Here 12 is the number of placed positions in this minmax game used in our paper, while 10 denotes the block size is 2^10 (1024) -- the best block size for this version. We provide some simple scripts in build-host, build-mic and build-profile directories: run-all-host.py, run-all-mic.py, and run-all-profile.py. (Similarly, we have some sample python scripts for other build directories.) Users can refer to these python scripts to see how to run each benchmark. Please note, users may need some modification to these scripts to achieve their own running objective. 5. How to collect the numbers ------------------------------ Below are detailed instructions for collecting the numbers presented in the paper. i) The execution time / speedup for each version After building our benchmarks, please enter the build-host/ and/or build-mic/ directories, and follow the above instructions to test the absolute execution time of our codes, and to verify the efficacy of our blocked simd execution. Please notice: a) use the appropriate block size (you can also choose different block sizes to see the performance trend shown in our paper); and b) the sequential blocked code cannot yield the performance benefit, and only the simd code can show performance benefit, i.e., the performance curve in our paper shows the blocked simd performance. In sum, to test the execution time of our code, one should test: -base, -block-noreexp-sse (similarly -mic) and -block-reexp-sse (similarly -mic) versions and set up the correct block size. NOTE: In order to test the pure breadth first execution, set the block size large enough for -block-noreexp-sse (similarly -mic), i.e, fib:30, binomial:31, knapsack:31, nqueens:24, graphcol:25, uts:14, parentheses:30, and minmax:26. ii) The cache miss profiling data To get the cache miss profiling data as shown in our paper, please also enter the build-host/ and/or build-mic/ directories, and test on -base, -block-noreexp-(sse|mic) and -block-reexp-(sse|mic), three versions. Our work uses two kinds of profilers: valgrind/cachegrind for CPU, and Intel Vtune for Xeon Phi, and here are two examples for fib: on CPU (build-host): valgrind --tool=cachegrind ./fib-block-reexp-sse.x 45 10 (You can also specify the result place: --cachegrind-out-file=resultX) on MIC (build-mic): amplxe-cl -collect knc-general-exploration ./fib-block-reexp-mic.x 45 10 (You can also specify the result place: -result-dir resultX ) iii) The simd utilization. The simd utilization is data from a sequential simulation. You can get this data without any real simd execution. In other words, if users have only g++ installed, this data is the most (or even the only) reasonable one to verify. Please enter the directory: build-profile, and test on two versions: -block-noreexp.x and -block-reexp.x. For all benchmarks except knapsack, we profile the simd utilization with simd width 16, and for knapsack, we profile the simd width 8. By varying the block size, we can get some curves as shown in our paper. NOTE: According to our paper, we define the simd utilization as the work percentage executed with full simd width, i.e., the number followed "work 16:" (or "work 8:" for knapsack) in the profiling output. iv) Other profiling results. Our paper also shows some other profiling data, such as the distribution of tasks for each benchmark, the benefit of re-expansion, and the speedup with and without stream compaction. Here are some simple instructions to get such profiling information: a. distribution of tasks: enter the build-distrib/ directory and run -block-noreexp.x code to get such information. When running this code, it is better to set the block size as the largest one, i.e, fib:30, binomial:31, knapsack:31, nqueens:24, graphcol:25, uts:14, parentheses:30, and minmax:26. b. benefit of re-expansion: enter the build-reexp/ directory and run -block-reexp.x code to get such information. When running this code, please use the best block size for CPU, i.e., fib:9, binomial:18, nqueens:15, graphcol:8, knapsack:11, uts:14, parentheses:11, minmax:10. c. stream compaction benefit: test only for fib and nqueens, for sse and mic versions. Enter the build-nostreamcompaction/ directory and run (fib|nqueens)-block-(noreexp|reexp)-(sse|mic).x code to get the simd performance without stream compaction. Please contact us by the e-mail address below in case you have any further questions on running this code. ********************************** Contact: Bin Ren bin.ren@pnnl.gov **********************************