Pagoda: A GPURuntime System for Narrow Tasks

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by theMasterKernelat thewarpgranularity. This level of control enables the GPU to keep scheduling andexecuting tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstratethat Pagodaachievesageometricmeanspeedupof5.52XoverPThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPUtask scheduling system.

Publication
In ACM Transactions on Parallel Computing, Volume 6, Issue 4
Tsung Tai Yeh
Tsung Tai Yeh
PhD Graduate, 2020.
Tim Rogers
Tim Rogers
Associate Professor of ECE