RELEASE ------- Omp2gpu 0.1 (May 21, 2010) Omp2gpu is a OpenMP-to-CUDA translator built on top of Cetus compiler infrastructure. The translation framework 1) supports fully automatic OpenMP-to-CUDA translation with various optimizations and 2) includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. FEATURES/UPDATES ---------------- ****************************************************** * Compiler flags used for OpenMP-to-CUDA translation * ****************************************************** -OmpAnalysisOnly Conduct OpenMP analysis only -OmpKernelSplitOnly Generate kernel-split OpenMP codes without CUDA translation -cudaGlobalMemSize=size Size of CUDA global memory in bytes (default value = 1600000000) Used for debugging -cudaMaxGridDimSize=size Maximum size of each dimension of a grid of thread blocks (System maximum = 65535) -cudaGridDimSize=size Size of each dimension of a grid of thread blocks, when thread block ID is a 2-dimensional array (max = 65535, default value = 10000). -cudaThreadBlockSize=size Size of CUDA thread block (default value = 256) -maxNumOfCudaThreadBlocks=N Maximum number of CUDA ThreadBlocks; if this option is on, tiling transformation code is added to fit work partition into the thread batching specified by this option and cudaThreadBlockSize. -useLoopCollapse Apply LoopCollapse optimization in OMP2GPU translation -useMallocPitch Use cudaMallocPitch() in OMP2CUDA translation -useMatrixTranspose Apply MatrixTranspose optimization in OMP2CUDA translation -useParallelLoopSwap Apply ParallelLoopSwap optimization in OMP2GPU translation -useUnrollingOnReduction Apply loop unrolling optimization for in-block reduction OMP2GPU translation; to apply this opt, thread block size (BLOCK_SIZE) = 2^m and m > 0. -assumeNonZeroTripLoops Assume that all loops have non-zero iterations. If a user approves this, more accurate analysis can be applied. -cudaMallocOptLevel=N CUDA Malloc optimization level (0-1) (default is 0) If N > 0, aggressive, may-unsafe optimizations are applied. -cudaMemTrOptLevel=N CUDA CPU-GPU memory transfer optimization level (0-4) (default is 3) If N > 3, aggressive, may-unsafe optimizations are applied. -useGlobalGMalloc Allocate GPU variables as global variables to reduce memory transfers between CPU and GPU -globalGMallocOpt Optimize global GPU variable allocation to reduce memory transfers; to use this option, useGlobalGMalloc should be on. If cudaMemTrOptLevel > 3, aggressive, may-unsafe optimizations are applied. -showGResidentGVars After each function call, show globally allocated GPU variables that are still residing in GPU global memory; this works only if globalGMallocOpt is on. (this info can be used for debugging.) -shrdSclrCachingOnReg Cache shared scalar variables onto GPU registers -shrdArryElmtCachingOnReg Cache shared array elements onto GPU registers -shrdSclrCachingOnSM Cache shared scalar variables onto GPU shared memory -prvtArryCachingOnSM Cache private array variables onto GPU shared memory -shrdArryCachingOnTM Cache 1-dimensional, R/O shared array variables onto GPU Texture memory -cudaConfFile=filename Name of the file that contains CUDA configuration parameters. Any valid OpenMP-to-GPU compiler flags can be put in the file; one flag per line, and lines starting with '#' will be ignored. Using this file, a user does not have to specify compiler flags of interest in a command-line input. The file should exist in the current directory. -cudaUserDirectiveFile=filename Name of the file that contains user-directives. Using this file, user can annotate each OpenMP parallel region without inserting user-directives directly into the input source code. Each line in the file has the following format: kernelid(id) procname(pname) [clause[[,] clause]...] where clause is one of the OpenMPC clauses used in "cuda gpurun" directives (explained in the next section) kernelid and procname clauses are automatically inserted by translator for each eligible kernel region. To find kernel IDs for kernel regions in an input program, a user have to run the translator at least once. (Running translator with OmpKernelSplitOnly option will show kernelID information with minimal changes on the input program.) The file should exist in the current directory. -extractTuningParameters=filename Extract tuning parameters; output will be stored in the specified file. (Default is TuningOptions.txt) The generated file contains information on tuning parameters (compiler flags and user-directives) applicable to the current input program. -genTuningConfFiles=tuningdir Generate tuning configuration files and/or user-directive files. Each tuning configuration file (and a user-directive file) constitutes a point in a optimization search space, defined for the current input program. A user can create a set of CUDA code variants, applied with different optimizations, using the output files. Output files will be stored in the specified directory. (Default is tuning_conf) -tuningLevel=N Set tuning level when genTuningConfFiles option is on; N = 1 (exhaustive search on program-level tuning options, default) N = 2 (exhaustive search on kernel-level tuning options) -defaultTuningConfFile=filename Name of the file that contains default CUDA tuning configurations, such as flags that are always applied and flags that are excluded. This file is used to set up an optimization search space, where a tuning system navigates to find the best performance. If the file does no exist, system-default setting will be used. (Default file name is cudaTuning.config) -addSafetyCheckingCode Add GPU-memory-usage checking code just before each kernel call (Used for debugging) -forceSyncKernelCall If enabled, cudaThreadSynchronize() call is inserted right after each kernel call to force explicit synchronization. (Useful for debugging) -addCudaErrorCheckingCode Add CUDA-error-checking code right after each kernel call. If this option is on, forceSyncKernelCall option is suppressed, since the error-checking code contains a built-in synchronization call. (Used for debugging) -disableCritical2ReductionConv Disable Critical-to-reduction conversion pass *************************** * OpenMPC User-directives * *************************** - The following four types of user-directives can be used to annotate each OpenMP parallel region. #pragma cuda gpurun [clause[[,] clause]...] where clause is one of the following c2gmemtr(list) noc2gmemtr(list) g2cmemtr(list) nog2cmemtr(list) registerRO(list) registerRW(list) noregister(list) sharedRO(list) sharedRW(list) noshared(list) texture(list) notexture(list) constant(list) noconstant(list) maxnumofblocks(nblocks) noreductionunroll(list) nocudamalloc(list) nocudafree(list) cudafree(list) noploopswap noloopcollapse threadblocksize(N) #pragma cuda cpurun [clause[[,] clause]...] where clause is one of the following c2gmemtr(list) noc2gmemtr(list) g2cmemtr(list) nog2cmemtr(list) #pragma cuda ainfo procname(proc-name) kernelid(id) #pragma cuda nogpurun - Brief description of OpenMPC clauses c2gmemtr(list) : Set the list of variables to be transferred from CPU to GPU noc2gmemtr(list) : Set the list of variables not to be transferred from CPU to GPU g2cmemtr(list) : Set the list of variables to be transferred from GPU to CPU nog2cmemtr(list) : Set the list of variables not to be transferred from GPU to CPU registerRO(list) : Cache R/O variables in the list onto GPU registers registerRW(list) : Cache R/W variables in the list onto GPU registers noregister(list) : Set the list of variables not to be cached on GPU registers sharedRO(list) : Cache R/O variables in the list onto GPU shared memory sharedRW(list) : Cache R/W variables in the list onto GPU shared memory noshared(list) : Set the list of variables not to be cached on GPU shared memory texture(list) : Cache variables in the list onto GPU texture memory notexture(list) : Set the list of variables not to be cached on GPU texture memory maxnumofblocks(nblocks) : Set maximum number of CUDA thread blocks for a kernel threadblocksize(N) : Set CUDA thread block size for a kernel noreductionunroll(list) : Do not apply loop unrolling for in-block reduction nocudamalloc(list) : Set the list of variables not to be CUDA-mallocated nocudafree(list) : Set the list of variables not to be CUDA-freed cudafree(list) : Set the list of variables to be CUDA-freed noploopswap : Do not apply Parallel Loop-Swap optimization noloopcollapse : Do not apply Loop Collapse optimization ********************************************** * Optimization Search Space Setup For Tuning * ********************************************** - If "genTuningConfFiles" flag is used, the O2G translator 1) analyzes an input program and finds all applicable tuning parameters, which consist of O2G compiler flags and OpenMPC clauses, and 2) generates a set of tuning configuration files and/or user-directive files for searching the given parameter space. (Each configuration file and user-directive file can be fed to the translator using "cudaConfFile" and "cudaUserDirectiveFile" flags.) A user can further restrict the search space using optimization-space-setup file ( "defaultTuningConfFile" flag is used to specify the setup file.) - The optimization-space-setup file has the following format: defaultGOptionSet(list) where list is a comma-separated list of O2G compiler flags that will be always applied. excludedGOptionSet(list) where list is a comma-separated list of O2G compiler flags that will not be applied. In the above two types, the following flags can be used. assumeNonZeroTripLoops useGlobalGMalloc globalGMallocOpt cudaMallocOptLevel cudaMemTrOptLevel useMatrixTranspose useMallocPitch useLoopCollapse useParallelLoopSwap useUnrollingOnReduction shrdSclrCachingOnReg shrdArryElmtCachingOnReg shrdSclrCachingOnSM prvtArryCachingOnSM shrdArryCachingOnTM cudaThreadBlockSize maxNumOfCudaThreadBlocks cudaMemTrOptLevel=N cudaMallocOptLevel=N cudaThreadBlockSet(list) where list is a comma-separated list of numbers maxnumofblocksSet(list) where list is a comma-separated list of numbers REQUIREMENTS ------------ Refer to readme file for Cetus. INSTALLATION ------------ Refer to readme file for Cetus. RUNNING OpenMP-to-GPU TRANSLATOR -------------------------------- Users can run the O2G translator in the following ways: $ java -classpath omp2gpu.exec.OMP2GPUDriver The "user_class_path" should include the class paths of Antlr and Cetus. LIMITATIONS ----------- - Current O2G translator generates CUDA code compatible with NVCC version 1.0. - NVCC V1.0 does not support inlining of external functions, and thus all device functions called in a kernel function should be in the same file where they are called. - OpenMP shared variables accessed in a kernel region (an OpenMP parallel region to be transformed into a kernel function) should be either scalar or array types. - OpenMP shared array variables accessed in the kernel region should have all dimension information, which is needed for O2G translator to allocate GPU memory for the shared array variables. For example, in the following function, if the function parameter, a, is an OpenMP shared variable, it may not be handled by the current O2G translator. void kernel_func( float a[] ) { ... } - Current reduction-transformation pass can not handle the cases where reduction is used in a function called in a parallel region. - To enforce a global synchronization in the CUDA model, an OpenMP parallel region is split into two sub-regions at each of OpenMP synchronization constructs, but this splitting may break correctness by resulting in upwardly exposed private variables. To check this problem, UEPrivateAnalysis is invoked at the end of O2G translation, and if a problem is detected, a programmer should fix it manually. (The analysis may generate false output due to inaccuracy.) Ex1: If the UEP variable is a read-only private variable, changing it to firstprivate variable can solve the problem. Ex2: Removing unnecessary barriers may solve the problem. REFERENCE --------- - To cite this work, use the following paper: Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann, "OpenMP to GPGPU: A compiler framework for automatic translation and optimization", in PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming. New York, NY, USA: ACM, Feb. 2009, pp. 101-110. BibTeX: @inproceedings{LeePPOPP09, author = {Seyong Lee and Seung-Jai Min and Rudolf Eigenmann}, title = {OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization}, booktitle = {PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming}, publisher = {ACM}, year = {2009}, pages = {101--110} }