RELEASE
-------
Omp2gpu 0.1 (May 21, 2010)

Omp2gpu is a OpenMP-to-CUDA translator built on top of Cetus compiler
infrastructure. The translation framework 1) supports fully automatic
OpenMP-to-CUDA translation with various optimizations
and 2) includes tuning capabilities for generating, pruning, and 
navigating the search space of compilation variants.


FEATURES/UPDATES
----------------
******************************************************
* Compiler flags used for OpenMP-to-CUDA translation *
******************************************************
  -OmpAnalysisOnly        
    Conduct OpenMP analysis only
  -OmpKernelSplitOnly
    Generate kernel-split OpenMP codes without CUDA translation
  -cudaGlobalMemSize=size
    Size of CUDA global memory in bytes (default value = 1600000000)
    Used for debugging
  -cudaMaxGridDimSize=size
    Maximum size of each dimension of a grid of thread blocks
    (System maximum = 65535)
  -cudaGridDimSize=size
    Size of each dimension of a grid of thread blocks, when thread
    block ID is a 2-dimensional array (max = 65535, default value = 10000).
  -cudaThreadBlockSize=size
    Size of CUDA thread block (default value = 256)
  -maxNumOfCudaThreadBlocks=N
    Maximum number of CUDA ThreadBlocks; if this option is on,
    tiling transformation code is added to fit work partition
    into the thread batching specified by this option and cudaThreadBlockSize. 
  -useLoopCollapse
    Apply LoopCollapse optimization in OMP2GPU translation
  -useMallocPitch
    Use cudaMallocPitch() in OMP2CUDA translation
  -useMatrixTranspose
    Apply MatrixTranspose optimization in OMP2CUDA translation
  -useParallelLoopSwap
    Apply ParallelLoopSwap optimization in OMP2GPU translation
  -useUnrollingOnReduction
    Apply loop unrolling optimization for in-block reduction
    OMP2GPU translation; to apply this opt, thread block size
    (BLOCK_SIZE) = 2^m and m > 0.  
  -assumeNonZeroTripLoops
    Assume that all loops have non-zero iterations. If a user approves
    this, more accurate analysis can be applied.
  -cudaMallocOptLevel=N
    CUDA Malloc optimization level (0-1) (default is 0)
    If N > 0, aggressive, may-unsafe optimizations are applied.
  -cudaMemTrOptLevel=N
    CUDA CPU-GPU memory transfer optimization level (0-4) (default is 3)
    If N > 3, aggressive, may-unsafe optimizations are applied.
  -useGlobalGMalloc
    Allocate GPU variables as global variables to reduce memory transfers
    between CPU and GPU 
  -globalGMallocOpt
    Optimize global GPU variable allocation to reduce memory transfers;
    to use this option, useGlobalGMalloc should be on.  
    If cudaMemTrOptLevel > 3, aggressive, may-unsafe optimizations are applied.
  -showGResidentGVars
    After each function call, show globally allocated GPU variables
    that are still residing in GPU global memory; this works only if 
    globalGMallocOpt is on. (this info can be used for debugging.)
  -shrdSclrCachingOnReg
    Cache shared scalar variables onto GPU registers
  -shrdArryElmtCachingOnReg
    Cache shared array elements onto GPU registers
  -shrdSclrCachingOnSM
    Cache shared scalar variables onto GPU shared memory
  -prvtArryCachingOnSM
    Cache private array variables onto GPU shared memory
  -shrdArryCachingOnTM
    Cache 1-dimensional, R/O shared array variables onto GPU Texture memory
  -cudaConfFile=filename
    Name of the file that contains CUDA configuration parameters.
    Any valid OpenMP-to-GPU compiler flags can be put in the file;
    one flag per line, and lines starting with '#' will be ignored.
    Using this file, a user does not have to specify compiler flags 
    of interest in a command-line input.
    The file should exist in the current directory.
  -cudaUserDirectiveFile=filename
    Name of the file that contains user-directives. 
    Using this file, user can annotate each OpenMP parallel region
    without inserting user-directives directly into the input source 
    code. 
    Each line in the file has the following format:
      kernelid(id) procname(pname) [clause[[,] clause]...]
        where clause is one of the OpenMPC clauses used in "cuda gpurun"
        directives (explained in the next section)
    kernelid and procname clauses are automatically inserted by translator
    for each eligible kernel region. To find kernel IDs for kernel 
    regions in an input program, a user have to run the translator 
    at least once. (Running translator with OmpKernelSplitOnly option will 
    show kernelID information with minimal changes on the input program.)
    The file should exist in the current directory.
  -extractTuningParameters=filename
    Extract tuning parameters; output will be stored in the specified
    file. (Default is TuningOptions.txt)
    The generated file contains information on tuning parameters
    (compiler flags and user-directives) applicable to the current input program.
  -genTuningConfFiles=tuningdir
    Generate tuning configuration files and/or user-directive files.
    Each tuning configuration file (and a user-directive file) constitutes
    a point in a optimization search space, defined for the current 
    input program.
    A user can create a set of CUDA code variants, applied with different 
    optimizations, using the output files.
    Output files will be stored in the specified directory. 
    (Default is tuning_conf)
  -tuningLevel=N
    Set tuning level when genTuningConfFiles option is on;
    N = 1 (exhaustive search on program-level tuning options, default)
    N = 2 (exhaustive search on kernel-level tuning options)
  -defaultTuningConfFile=filename
    Name of the file that contains default CUDA tuning configurations,
    such as flags that are always applied and flags that are excluded.
    This file is used to set up an optimization search space, where
    a tuning system navigates to find the best performance.
    If the file does no exist, system-default setting will be used.
    (Default file name is cudaTuning.config) 
  -addSafetyCheckingCode
    Add GPU-memory-usage checking code just before each kernel call
    (Used for debugging)
  -forceSyncKernelCall
    If enabled, cudaThreadSynchronize() call is inserted right after
    each kernel call to force explicit synchronization. 
    (Useful for debugging)
  -addCudaErrorCheckingCode
    Add CUDA-error-checking code right after each kernel call.
    If this option is on, forceSyncKernelCall option is suppressed, 
    since the error-checking code contains a built-in synchronization
    call. 
    (Used for debugging)
	-disableCritical2ReductionConv
    Disable Critical-to-reduction conversion pass

***************************
* OpenMPC User-directives *
***************************
- The following four types of user-directives can be used to annotate 
  each OpenMP parallel region.
  #pragma cuda gpurun [clause[[,] clause]...]

    where clause is one of the following
      c2gmemtr(list) 
      noc2gmemtr(list) 
      g2cmemtr(list) 
      nog2cmemtr(list)
      registerRO(list) 
      registerRW(list) 
      noregister(list)
      sharedRO(list) 
      sharedRW(list) 
      noshared(list) 
      texture(list) 
      notexture(list) 
      constant(list)
      noconstant(list)
      maxnumofblocks(nblocks)
      noreductionunroll(list)
      nocudamalloc(list)
      nocudafree(list)
      cudafree(list)
      noploopswap
      noloopcollapse
      threadblocksize(N)

  #pragma cuda cpurun [clause[[,] clause]...]

    where clause is one of the following
      c2gmemtr(list) 
      noc2gmemtr(list) 
      g2cmemtr(list) 
      nog2cmemtr(list)

  #pragma cuda ainfo procname(proc-name) kernelid(id)

  #pragma cuda nogpurun

- Brief description of OpenMPC clauses
  c2gmemtr(list)   : Set the list of variables to be transferred from CPU to GPU
  noc2gmemtr(list) : Set the list of variables not to be transferred from CPU to GPU 
  g2cmemtr(list)   : Set the list of variables to be transferred from GPU to CPU 
  nog2cmemtr(list) : Set the list of variables not to be transferred from GPU to CPU
  registerRO(list) : Cache R/O variables in the list onto GPU registers 
  registerRW(list) : Cache R/W variables in the list onto GPU registers 
  noregister(list) : Set the list of variables not to be cached on GPU registers
  sharedRO(list)   : Cache R/O variables in the list onto GPU shared memory 
  sharedRW(list)   : Cache R/W variables in the list onto GPU shared memory 
  noshared(list)   : Set the list of variables not to be cached on GPU shared memory
  texture(list)    : Cache variables in the list onto GPU texture memory 
  notexture(list)  : Set the list of variables not to be cached on GPU texture memory 
  maxnumofblocks(nblocks) : Set maximum number of CUDA thread blocks for a kernel
  threadblocksize(N)      : Set CUDA thread block size for a kernel
  noreductionunroll(list) : Do not apply loop unrolling for in-block reduction
  nocudamalloc(list)      : Set the list of variables not to be CUDA-mallocated
  nocudafree(list)        : Set the list of variables not to be CUDA-freed
  cudafree(list)          : Set the list of variables to be CUDA-freed
  noploopswap             : Do not apply Parallel Loop-Swap optimization
  noloopcollapse          : Do not apply Loop Collapse optimization

**********************************************
* Optimization Search Space Setup For Tuning *
**********************************************
- If "genTuningConfFiles" flag is used, the O2G translator 1) analyzes 
  an input program and finds all applicable tuning parameters, which consist of 
  O2G compiler flags and OpenMPC clauses, and 2) generates a set of tuning 
  configuration files and/or user-directive files for searching the given 
  parameter space. (Each configuration file and user-directive file can be
  fed to the translator using "cudaConfFile" and "cudaUserDirectiveFile" flags.)
  A user can further restrict the search space using 
  optimization-space-setup file ( "defaultTuningConfFile" flag is used to 
  specify the setup file.)

- The optimization-space-setup file has the following format:
  defaultGOptionSet(list)
    where list is a comma-separated list of O2G compiler flags that will
    be always applied.

  excludedGOptionSet(list)
    where list is a comma-separated list of O2G compiler flags that will
    not be applied.

    In the above two types, the following flags can be used.
      assumeNonZeroTripLoops
      useGlobalGMalloc
      globalGMallocOpt
      cudaMallocOptLevel     
      cudaMemTrOptLevel
      useMatrixTranspose
      useMallocPitch
      useLoopCollapse
      useParallelLoopSwap
      useUnrollingOnReduction
      shrdSclrCachingOnReg
      shrdArryElmtCachingOnReg
      shrdSclrCachingOnSM
      prvtArryCachingOnSM
      shrdArryCachingOnTM
      cudaThreadBlockSize
      maxNumOfCudaThreadBlocks

  cudaMemTrOptLevel=N

  cudaMallocOptLevel=N

  cudaThreadBlockSet(list)
    where list is a comma-separated list of numbers

  maxnumofblocksSet(list)
    where list is a comma-separated list of numbers
  

REQUIREMENTS
------------
Refer to readme file for Cetus.


INSTALLATION
------------
Refer to readme file for Cetus.


RUNNING OpenMP-to-GPU TRANSLATOR
--------------------------------
Users can run the O2G translator in the following ways:

  $ java -classpath <user_class_path> omp2gpu.exec.OMP2GPUDriver <options> <C files>

The "user_class_path" should include the class paths of Antlr and Cetus.

LIMITATIONS
-----------
- Current O2G translator generates CUDA code compatible with NVCC version 1.0.
  - NVCC V1.0 does not support inlining of external functions, and thus 
    all device functions called in a kernel function should be in the same
    file where they are called.

- OpenMP shared variables accessed in a kernel region (an OpenMP parallel region 
  to be transformed into a kernel function) should be either scalar or array types.
  - OpenMP shared array variables accessed in the kernel region should have all 
    dimension information, which is needed for O2G translator to allocate GPU memory 
    for the shared array variables.
    For example, in the following function, if the function parameter, a, is an OpenMP
    shared variable, it may not be handled by the current O2G translator.
    
      void kernel_func( float a[] ) { ... }  

- Current reduction-transformation pass can not handle the cases where
  reduction is used in a function called in a parallel region.

- To enforce a global synchronization in the CUDA model, an OpenMP parallel region
  is split into two sub-regions at each of OpenMP synchronization constructs, but
  this splitting may break correctness by resulting in upwardly exposed private 
  variables. To check this problem, UEPrivateAnalysis is invoked at the end of 
  O2G translation, and if a problem is detected, a programmer should fix it manually.
  (The analysis may generate false  output due to inaccuracy.)
		Ex1: If the UEP variable is a read-only private variable, changing it to firstprivate
		variable can solve the problem.
		Ex2: Removing unnecessary barriers may solve the problem.

REFERENCE
---------
- To cite this work, use the following paper:

  Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann, "OpenMP to GPGPU: A compiler
  framework for automatic translation and optimization", in PPoPP '09: Proceedings
  of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming.
  New York, NY, USA: ACM, Feb. 2009, pp. 101-110.

  BibTeX:

@inproceedings{LeePPOPP09,
  author = {Seyong Lee and Seung-Jai Min and Rudolf Eigenmann},
  title = {OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization},
  booktitle = {PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming},
  publisher = {ACM},
  year = {2009},
  pages = {101--110}
}