Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems

Taehyeong Park, Seokwon Kang, Myung Hwan Jang, Sang Wook Kim, Yongjun Park

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Sparse general matrix-matrix multiplication (SpGEMM) is a major kernel in various emerging applications, such as database management systems, deep learning, graph analysis, and recommendation systems. Since SpGEMM requires extensive computation, many SpGEMM techniques have been implemented based on graphics processing units (GPUs) to exploit massive data parallelism completely. However, traditional SpGEMM techniques usually do not fully utilize the GPU because most non-zero elements of the target sparse matrices exist in a few hub nodes, and non-hub nodes barely have non-zero elements. The data-related characteristics (power law) result in a significant degradation in performance because of the load imbalance between the GPU cores and the low utilization of each core. Many attempts have been made through recent implementations to solve this problem using smart pre-/post-processing. However, the net performance hardly improves and sometimes even deteriorates owing to the large overheads. Additionally, non-hub nodes are inherently not suitable for GPU computing, even after optimization. Furthermore, the performance is no longer dominated by kernel execution, but by data transfers such as device-to-host (D2H) data transfers and file I/Os, owing to the rapid growth in the computing power of GPUs and input data size.Therefore, this work proposes a Dynamic Block Distributor (DBD), a novel full-system-level SpGEMM orchestration framework for heterogeneous systems, improving the overall performance by enabling an efficient CPU-GPU collaboration and further minimizing the overhead in data transfer between all the system elements. This framework first divides the target matrix into smaller blocks and then offloads the computation of each block to an appropriate computing unit between a GPU and CPU based on its workload type and the status of resource utilization at runtime. It also minimizes the overhead in data transfer with simple but suitable techniques, such as Row Collecting, I/O Overlapping, and I/O Binding. Our experiments showed that this framework increased the execution latency of SpGEMM, which included both the kernel execution and D2H transfers, by 3.24x on average, and the overall execution time by 2.07x on average, compared to that of the baseline cuSPARSE library.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE 39th International Conference on Data Engineering, ICDE 2023
PublisherIEEE Computer Society
Pages2456-2459
Number of pages4
ISBN (Electronic)9798350322279
DOIs
Publication statusPublished - 2023
Event39th IEEE International Conference on Data Engineering, ICDE 2023 - Anaheim, United States
Duration: 2023 Apr 32023 Apr 7

Publication series

NameProceedings - International Conference on Data Engineering
Volume2023-April
ISSN (Print)1084-4627

Conference

Conference39th IEEE International Conference on Data Engineering, ICDE 2023
Country/TerritoryUnited States
CityAnaheim
Period23/4/323/4/7

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems'. Together they form a unique fingerprint.

Cite this