RELEASE ------- Omp2gpu 0.2 (Sept. 10, 2010) Omp2gpu is a OpenMP-to-CUDA translator built on top of Cetus compiler infrastructure. The translation framework 1) supports fully automatic OpenMP-to-CUDA translation with various optimizations and 2) includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. FEATURES/UPDATES ---------------- ****************************************************** * Compiler flags used for OpenMP-to-CUDA translation * ****************************************************** ################################### # Safe, always-beneficial Options # ################################### -cudaMemTrOptLevel=N where N = 1, 2, or 3 CUDA CPU-GPU memory transfer optimization level (0-4) (default is 2) CF1: If N > 3, aggressive, may-unsafe optimizations are applied. CF2: If N >= 3, "cudaMallocOptLevel" flag is automatically set to 1. CF3: Optimization level of "globalGMallocOpt" flag is also controlled by this flag. ############################################################################# # Safe, always-beneficial Options, but resources may limit its application. # ############################################################################# -useGlobalGMalloc Allocate GPU variables as global variables to reduce memory transfers between CPU and GPU -useMatrixTranspose Apply MatrixTranspose optimization in OMP2CUDA translation. This optimization is applicable when a threadprivate variable is an array of 1 dimension. -useMallocPitch Use cudaMallocPitch() in OMP2CUDA translation. This transformation is applicable when a shared variable is an array of 2 dimensions. ############################################################# # May-beneficial Options, which interact with other options # ############################################################# -useLoopCollapse Apply LoopCollapse optimization in OMP2GPU translation. -useParallelLoopSwap Apply ParallelLoopSwap optimization in OMP2GPU translation. -useUnrollingOnReduction Apply loop unrolling optimization for in-block reduction in OMP2GPU translation; to apply this opt, thread block size, BLOCK_SIZE = 2^m and m > 0. Each kernel region can be tuned using "noreductionunroll" clause. ################################################################### # Always-beneficial Options, but inaccuracy of the analysis may # # break correctness (aggressive, unsafe options) # ################################################################### -globalGMallocOpt Optimize global GPU variable allocation to reduce memory transfers; to use this option, "useGlobalGMalloc" flag should be on. CF1: If "cudaMemTrOptLevel" > 3, aggressive, may-unsafe optimizations are applied. -cudaMemTrOptLevel=N where N = 4 CUDA CPU-GPU memory transfer optimization level (0-4) (default is 2) CF1: If N < 4, always-beneficial, safe optimizations are applied. CF2: If N >= 3, "cudaMallocOptLevel" flag is automatically set to 1. -cudaMallocOptLevel=1 CUDA Malloc optimization level (0-1) (default is 0) CF1: If "cudaMemTrOptLevel" flag has value bigger than 2, this flag is automatically set to 1. #################################################################### # Safe, always-beneficial Options, but user's approval is required # #################################################################### -assumeNonZeroTripLoops Assume that all loops have non-zero iterations. If a user approves this, more accurate analyses can be applied. ############################################################## # May-beneficial options, which interact with other options. # # These options are not needed if kernel-specific options # # are used. # ############################################################## -shrdSclrCachingOnReg Cache shared scalar variables onto GPU registers. CF1: Each kernel region can be tuned using registerRO, registerRW, and noregister clauses. -shrdArryElmtCachingOnReg Cache shared array elements onto GPU registers. CF1: Each kernel region can be tuned using registerRO, registerRW, and noregister clauses. -shrdSclrCachingOnSM Cache shared scalar variables onto GPU shared memory. CF1: Each kernel region can be tuned using sharedRO, sharedRW, and noshared clauses. -prvtArryCachingOnSM Cache private array variables onto GPU shared memory. CF1: Each kernel region can be tuned using sharedRO, sharedRW, and noshared clauses. -shrdArryCachingOnTM Cache 1-dimensional, R/O shared array variables onto GPU Texture memory. CF1: Each kernel region can be tuned using texture and notexture clauses. #################################################### # Non-tunable options, but user may need to apply # # some of these to generate correct output code. # #################################################### -disableCritical2ReductionConv Disable Critical-to-reduction conversion pass. -doNotRemoveUnusedSymbols Do not remove unused local symbols in procedures. -UEPRemovalOptLevel=N, Optimization level (0-3) to remove upwardly exposed private (UEP) variables (default is 0, which does not apply this optimization). CF1: This optimization may be unsafe; this should be enabled only if UEP problems occur, and programmer should verify the correctness manually. -forceSyncKernelCall If enabled, cudaThreadSynchronize() call is inserted right after each kernel call to force explicit synchronization; useful for debugging. ################################## # Options for CUDA configuration # ################################## -cudaThreadBlockSize=number Size of CUDA thread block (default value = 128) -maxNumOfCudaThreadBlocks=N Maximum number of CUDA ThreadBlocks; if this option is on, tiling transformation code is added to fit work partition into the thread batching specified by this flag and "cudaThreadBlockSize" flag. -cudaMaxGridDimSize=size Maximum size of each dimension of a grid of thread blocks (System maximum = 65535) -cudaGridDimSize=size Size of each dimension of a grid of thread blocks, when thread block ID is a 2-dimensional array (max = 65535, default value = 10000). -cudaGlobalMemSize=size Size of CUDA global memory in bytes (default value = 1600000000) Used for debugging ###################### # Options for tuning # ###################### -cudaConfFile=filename Name of the file that contains CUDA configuration parameters. Any valid OpenMP-to-GPU compiler flags can be put in the file; one flag per line, and lines starting with '#' will be ignored. CF1: Using this file, a user does not have to specify compiler flags of interest in a command-line input. CF2: The file should exist in the current directory. -cudaUserDirectiveFile=filename Name of the file that contains user-directives. Using this file, user can annotate each OpenMP parallel region without inserting user-directives directly into the input source code. Each line in the file has the following format: kernelid(id) procname(pname) [clause[[,] clause]...] where clause is one of the OpenMPC clauses used in "cuda gpurun" directives (explained in the next section) kernelid and procname clauses are automatically inserted by translator for each eligible kernel region. To find kernel IDs for kernel regions in an input program, a user have to run the translator at least once. (Running translator with OmpKernelSplitOnly option will show kernelID information with minimal changes on the input program.) The file should exist in the current directory. -extractTuningParameters=filename Extract tuning parameters; output will be stored in the specified file. (Default is TuningOptions.txt) The generated file contains information on tuning parameters (compiler flags and user-directives) applicable to the current input program. -genTuningConfFiles=tuningdir Generate tuning configuration files and/or user-directive files. Each tuning configuration file (and a user-directive file) constitutes a point in a optimization search space, defined for the current input program. A user can create a set of CUDA code variants, applied with different optimizations, using the generated output files. Output files will be stored in the specified directory. (Default is tuning_conf) -tuningLevel=N Set tuning level when genTuningConfFiles option is on; N = 1 (exhaustive search on program-level tuning options, default) N = 2 (exhaustive search on kernel-level tuning options) -defaultTuningConfFile=filename Name of the file that contains default CUDA tuning configurations, such as flags that are always applied and flags that are excluded. This file is used to set up an optimization search space, where a tuning system navigates to find the best performance. If the file does no exist, system-default setting will be used. (Default file name is cudaTuning.config) ######################### # Options for debugging # ######################### -showGResidentGVars After each function call, show globally allocated GPU variables that are still residing in GPU global memory; this works only if "globalGMallocOpt" flag is on. (this info can be used for debugging.) -addSafetyCheckingCode Add GPU-memory-usage checking code just before each kernel call (Used for debugging) -addCudaErrorCheckingCode Add CUDA-error-checking code right after each kernel call. If this option is on, "forceSyncKernelCall" option is suppressed, since the error-checking code contains a built-in synchronization call. Used for debugging -OmpAnalysisOnly Conduct OpenMP analysis only. -OmpKernelSplitOnly Generate kernel-split OpenMP codes without CUDA translation. *************************** * OpenMPC User-directives * *************************** - The following four types of user-directives can be used to annotate each OpenMP parallel region. #pragma cuda gpurun [clause[[,] clause]...] where clause is one of the following c2gmemtr(list) noc2gmemtr(list) g2cmemtr(list) nog2cmemtr(list) registerRO(list) registerRW(list) noregister(list) sharedRO(list) sharedRW(list) noshared(list) texture(list) notexture(list) constant(list) noconstant(list) maxnumofblocks(nblocks) noreductionunroll(list) nocudamalloc(list) nocudafree(list) cudafree(list) noploopswap noloopcollapse threadblocksize(N) #pragma cuda cpurun [clause[[,] clause]...] where clause is one of the following c2gmemtr(list) noc2gmemtr(list) g2cmemtr(list) nog2cmemtr(list) #pragma cuda ainfo procname(proc-name) kernelid(id) #pragma cuda nogpurun - Brief description of OpenMPC clauses c2gmemtr(list) : Set the list of variables to be transferred from CPU to GPU noc2gmemtr(list) : Set the list of variables not to be transferred from CPU to GPU g2cmemtr(list) : Set the list of variables to be transferred from GPU to CPU nog2cmemtr(list) : Set the list of variables not to be transferred from GPU to CPU registerRO(list) : Cache R/O variables in the list onto GPU registers registerRW(list) : Cache R/W variables in the list onto GPU registers noregister(list) : Set the list of variables not to be cached on GPU registers sharedRO(list) : Cache R/O variables in the list onto GPU shared memory sharedRW(list) : Cache R/W variables in the list onto GPU shared memory noshared(list) : Set the list of variables not to be cached on GPU shared memory texture(list) : Cache variables in the list onto GPU texture memory notexture(list) : Set the list of variables not to be cached on GPU texture memory maxnumofblocks(nblocks) : Set maximum number of CUDA thread blocks for a kernel threadblocksize(N) : Set CUDA thread block size for a kernel noreductionunroll(list) : Do not apply loop unrolling for in-block reduction nocudamalloc(list) : Set the list of variables not to be CUDA-mallocated nocudafree(list) : Set the list of variables not to be CUDA-freed cudafree(list) : Set the list of variables to be CUDA-freed noploopswap : Do not apply Parallel Loop-Swap optimization noloopcollapse : Do not apply Loop Collapse optimization - Usage of OpenMPC clauses "registerRO" may contain R/O shared scalar variables R/O shared array element (ex: a[i]) "registerRW" may contain R/W shared scalar variables R/W shared array element (ex: a[i]) "noregister" may contain R/O or R/W shared scalar variables R/O or R/W shared array element (ex: a[i]) "sharedRO" may contain R/O shared scalar variables R/O private array variables "sharedRW" may contain R/W shared scalar variables (not yet implemented) R/W private array variables "noshared" may contain R/O or R/W shared scalar variables R/O or R/W private array variables "texture" may contain R/O 1-dimensional shared array "notexture" may contain R/O 1-dimensional shared array ********************************************** * Optimization Search Space Setup For Tuning * ********************************************** - If "genTuningConfFiles" flag is used, the O2G translator 1) analyzes an input program and finds all applicable tuning parameters, which consist of O2G compiler flags and OpenMPC clauses, and 2) generates a set of tuning configuration files and/or user-directive files for searching the given parameter space. (Each configuration file and user-directive file can be fed to the translator using "cudaConfFile" and "cudaUserDirectiveFile" flags.) A user can further restrict the search space using optimization-space-setup file ( "defaultTuningConfFile" flag is used to specify the setup file.) - The optimization-space-setup file has the following format: defaultGOptionSet(list) where list is a comma-separated list of O2G compiler flags that will be always applied. excludedGOptionSet(list) where list is a comma-separated list of O2G compiler flags that will not be applied. In the above two types, the following flags can be used. assumeNonZeroTripLoops useGlobalGMalloc globalGMallocOpt cudaMallocOptLevel cudaMemTrOptLevel useMatrixTranspose useMallocPitch useLoopCollapse useParallelLoopSwap useUnrollingOnReduction shrdSclrCachingOnReg shrdArryElmtCachingOnReg shrdSclrCachingOnSM prvtArryCachingOnSM shrdArryCachingOnTM cudaThreadBlockSize maxNumOfCudaThreadBlocks For "defaultGOptionSet", additional, following flags can be used. disableCritical2ReductionConv UEPRemovalOptLevel forceSyncKernelCall doNotRemoveUnusedSymbols cudaMemTrOptLevel=N cudaMallocOptLevel=N UEPRemovalOptLevel=N cudaThreadBlockSet(list) where list is a comma-separated list of numbers maxnumofblocksSet(list) where list is a comma-separated list of numbers ********************** * Additional feature * ********************** - O2G translator accepts OpenMP "collapse" clauses, which can be used to collapse nested, parallel loops into a single loop with bigger iteration size. REQUIREMENTS ------------ Refer to readme file for Cetus. INSTALLATION ------------ Refer to readme file for Cetus. RUNNING OpenMP-to-GPU TRANSLATOR -------------------------------- Users can run the O2G translator in the following ways: $ java -classpath omp2gpu.exec.OMP2GPUDriver The "user_class_path" should include the class paths of Antlr and Cetus. LIMITATIONS ----------- - Current O2G translator generates CUDA code compatible with NVCC version 1.0. - NVCC V1.0 does not support inlining of external functions, and thus all device functions called in a kernel function should be in the same file where they are called. - OpenMP shared variables accessed in a kernel region (an OpenMP parallel region to be transformed into a kernel function) should be either scalar or array types. - OpenMP shared array variables accessed in the kernel region should have all dimension information, which is needed for O2G translator to allocate GPU memory for the shared array variables. For example, in the following function, if the function parameter, a, is an OpenMP shared variable, it may not be handled by the current O2G translator. void kernel_func( float a[] ) { ... } - Current reduction-transformation pass can not handle the cases where reduction is used in a function called in a parallel region. - To enforce a global synchronization in the CUDA model, an OpenMP parallel region is split into two sub-regions at each of OpenMP synchronization constructs, but this splitting may break correctness by resulting in upwardly exposed private variables. To check this problem, UEPrivateAnalysis is invoked at the end of O2G translation, and if a problem is detected, a programmer should fix it by modifying an input OpenMP program manually. Here are some suggestions: 1) Try a built-in UEP Removal optimization (UEPRemovalOptLevel=N, where N = 1, 2, or 3). Because this optimization may be unsafe, the programmer should verify the correctness manually. 2) Remove unnecessary synchronizations such as removing unnecessary barriers or adding nowait clause to omp-for-loop annotations if applicable. 3) If all participating threads have the same value for the private variable, - Either add firstprivate OpenMP clause to related omp parallel regions. - Or if the first method is not applicable, change the private variable to shared variable, and enclose statements that write the variable with omp single directive. CF: Due to the inaccuray of a compiler analysis, the warning messages generated by UEPrivateAnalysis may be false ones. Here are some reasons causing wrong warning: 1) Current UEPrivateAnalysis conducts an intraprocedural analysis; if a function is called inside of the target procedure, the analysis conservatively assumes all variables of interest are accessed in the called function, and thus analysis may result in overly estimated, false outputs. 2) Private arrays can be falsely included to the UEUSE set; even if they are initialized in for-loops, compiler may not be sure of their initialization due to the possibility of zero-trip loops. 3) Current UEPrivateAnalysis handles scalar and array expressions, but not pointer expressions. Therefore, if a memory region is accessed both by a pointer expression and by an array expression, the analysis may not be able to return accurate results. REFERENCE --------- - To cite this work, use the following papers: [1] Seyong Lee and Rudolf Eigenmann, "OpenMPC: Extended OpenMP Programming and Tuning for GPUs", in SC'10: Proceedings of the 2010 ACM/IEEE conference on Supercomputing. New Orleans, Louisiana, USA: IEEE Press, 2010. BibTeX: @inproceedings{LeeSC10, author = {Seyong Lee and Rudolf Eigenmann}, title = {OpenMPC: Extended OpenMP Programming and Tuning for GPUs}, booktitle = {SC'10: Proceedings of the 2010 ACM/IEEE conference on Supercomputing}, publisher = {IEEE press}, year = {2010}, pages = {} } [2] Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann, "OpenMP to GPGPU: A compiler framework for automatic translation and optimization", in PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, Feb. 2009, pp. 101-110. BibTeX: @inproceedings{LeePPOPP09, author = {Seyong Lee and Seung-Jai Min and Rudolf Eigenmann}, title = {OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization}, booktitle = {PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming}, publisher = {ACM}, year = {2009}, pages = {101--110} }