Try the web-demo of OpenMP to CUDA converter here. This demo also contains OpenMPC features. OpenMPC: Extended OpenMP Programming and Tuning for GPUs

OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Seyong Lee, Seung-Jai Min, and Professor Rudolf Eigenmann
School of Electrical and Computer Engineering, Purdue University


  • General-Purpose Graphics Processing Units (GPGPUs) have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone.
  • OpenMP has established itself as an important method and language extension for programming shared-memory parallel computers. There are several advantages of OpenMP as a programming paradigm for GPGPUs.
    • OpenMP is efficient at expressing loop-level parallelism in applications, which is an ideal target for utilizing GPU's highly parallel computing units to accelerate data-parallel computations.
    • The concept of a master thread and a pool of worker threads in OpenMP's fork-join model represents well the relationship between the master thread running in a host CPU and a pool of threads in a GPU device.
    • Incremental parallelization of applications, which is one of OpenMP's features, can add the same benefit to GPGPU programming.
  • However, using OpenMP as a front-end programming model for GPGPUs may not be enough; because OpenMP is a platform-independent programming model, it gives little control over fine-grained tuning to achieve optimal GPGPU performance.
  • OpenMPC - OpenMP extended for CUDA

  • We proposes a new programming interface, called OpenMPC, which consists of a standard OpenMP API plus a new set of directives and environment variables to control important CUDA-related parameters and optimizations. OpenMPC addresses two important issues on GPGPU programming: programmability and tunability. OpenMPC as a front-end programming model provides programmers with abstractions of the complex CUDA programming model and high-level controls over various optimizations and CUDA-related parameters.
  • We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants.
  • Overall Compilation Flow

    Overall Compilation Flow
    Fig. 1. Overall Compilation Flow. When the compilation system is used for automatic tuning, additional passes are invoked between CUDA Optimizer and O2G Translator, marked as (A) in the figure (See Figure 2)

    Figure 1 shows the overall flow of the compilation.
    • The Cetus Parser reads the input OpenMPC program and generates an internal representation (Cetus IR).
    • The OpenMP Analyzer recognizes standard OpenMP directives and analyzes the program to find all OpenMP shared, threadprivate, private, and reduction variables that are explicitly and implicitly used in each parallel region. The analyzer also identifies implicit barriers by OpenMP semantics and adds explicit barrier statements at each implicit synchronization point.
    • The Kernel Splitter divides parallel regions at each synchronization point to enforce synchronization semantics under the CUDA programming model.
    • The OpenMPC-directive Handler annotates each kernel region with a directive to assign a unique ID and parses a user directive file, if present. The handler also processes possible OpenMPC directives present in the input program.
    • The OpenMP Stream Optimizer transforms traditional CPU-oriented OpenMP programs into OpenMP programs optimized for GPGPUs, and the CUDA Optimizer performs CUDA-specific optimizations. Both optimization passes express their results in the form of OpenMPC directives in the Cetus IR.
    • In the last pass, the O2G Translator performs the actual code transformations according to the directives provided either by a user or by the optimization passes.

    Prototype Tuning System

    Prototype Tuning System
    Fig. 2. Overall Tuning Framework. In the figure, input OpenMPC code is an output IR from CUDA Optimizer in the compilation system (See Figure 1)

    We have created a prototype tuning system, shown in Figure 2. The overall tuning process is as follows:
    • The search space pruner analyzes an input OpenMPC program plus optional user settings, which exist as annotations in the input program, and suggests applicable tuning parameters.
    • The tuning configuration generator builds a search space, further prunes the space using the optimization space setup file if user-provided, and generates tuning configuration files for the given search space.
    • For each tuning configuration, the O2G translator generates an output CUDA program.
    • The tuning engine produces executables from the generated CUDA programs and measures the performance of the CUDA programs by running the executables.
    • The tuning engine decides a direction to the next search and requests the to generate new configurations.
    • The last three steps are repeated, as needed.
    In the example tuning framework, a programmer can replace the tuning engine with any custom engine; all the other steps from finding tunable parameters to complex code changes for each tuning configuration are automatically handled by the proposed compilation system.


    Seyong Lee and Rudolf Eigenmann, OpenMPC: Extended OpenMP Programming and Tuning for GPUs , SC10: Proceedings of the 2010 ACM/IEEE conference on Supercomputing (Best Student Paper Award), November 2010.
    Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann, OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization , Symposium on Principles and Practice of Parallel Programming (PPoPP'09), February 2009.

    Software Download

  • OpenMPC compiler
  • Funding

    This work is supported in part by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.


  • Seyong Lee (E-mail: lees2 AT ornl DOT gov) (Home page)
  • [go to top]
    Go Back to the Paramount Group Research Page.