Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, Murali Annavaram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

86 Citations (Scopus)


As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

Original languageEnglish
Title of host publicationProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages13
ISBN (Electronic)9781467389471
Publication statusPublished - 2016 Aug 24
Event43rd International Symposium on Computer Architecture, ISCA 2016 - Seoul, Korea, Republic of
Duration: 2016 Jun 182016 Jun 22

Publication series

NameProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016


Other43rd International Symposium on Computer Architecture, ISCA 2016
Country/TerritoryKorea, Republic of

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture


Dive into the research topics of 'Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming'. Together they form a unique fingerprint.

Cite this