TY - JOUR
T1 - SHREG
T2 - Mitigating register redundancy in GPUs
AU - Jin, Seunghyun
AU - Lee, Hyunwuk
AU - Lee, Jonghyun
AU - Kim, Junsung
AU - Ro, Won Woo
N1 - Publisher Copyright:
© 2024
PY - 2024/7
Y1 - 2024/7
N2 - Graphics Processing Units (GPUs) have become dominant accelerators for Machine Learning (ML) and High-Performance Computing (HPC) applications due to their massive parallelism capabilities, through the utilization of general matrix-to-matrix multiplication (GEMM) kernels. However, GEMM kernels often suffer from duplicated memory requests, mainly caused by matrix tiling used for handling large matrices. While GPUs have adopted programmable shared memory to mitigate this issue by preserving frequently reused data in shared memory, GEMM still introduces duplication in register files. Our observations show that the matrix tiling issues memory requests to the same shared memory address for neighboring threads, and this results in a substantial increase in the number of duplicated data in the register files. Such duplication degrades GPU performance by limiting warp-level parallelism due to the register shortage and redundant memory requests to shared memory. We find that the data duplication can be categorized into two types that occur with fixed patterns during the matrix tiling. Based on these observations, we introduce SHREG, an architecture design that enables different threads to share registers for overlapped data from shared memory, effectively reducing duplicated data within the register files. By leveraging the duplication patterns, SHREG utilizes register sharing and improves performance with minimal hardware overhead. Our evaluation shows that SHREG improves performance by 31.4% on various ML applications over the baseline GPU.
AB - Graphics Processing Units (GPUs) have become dominant accelerators for Machine Learning (ML) and High-Performance Computing (HPC) applications due to their massive parallelism capabilities, through the utilization of general matrix-to-matrix multiplication (GEMM) kernels. However, GEMM kernels often suffer from duplicated memory requests, mainly caused by matrix tiling used for handling large matrices. While GPUs have adopted programmable shared memory to mitigate this issue by preserving frequently reused data in shared memory, GEMM still introduces duplication in register files. Our observations show that the matrix tiling issues memory requests to the same shared memory address for neighboring threads, and this results in a substantial increase in the number of duplicated data in the register files. Such duplication degrades GPU performance by limiting warp-level parallelism due to the register shortage and redundant memory requests to shared memory. We find that the data duplication can be categorized into two types that occur with fixed patterns during the matrix tiling. Based on these observations, we introduce SHREG, an architecture design that enables different threads to share registers for overlapped data from shared memory, effectively reducing duplicated data within the register files. By leveraging the duplication patterns, SHREG utilizes register sharing and improves performance with minimal hardware overhead. Our evaluation shows that SHREG improves performance by 31.4% on various ML applications over the baseline GPU.
KW - Data reuse
KW - Graphics Processing Unit (GPU)
KW - Register file
KW - Shared memory
UR - http://www.scopus.com/inward/record.url?scp=85191968653&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191968653&partnerID=8YFLogxK
U2 - 10.1016/j.sysarc.2024.103152
DO - 10.1016/j.sysarc.2024.103152
M3 - Article
AN - SCOPUS:85191968653
SN - 1383-7621
VL - 152
JO - Journal of Systems Architecture
JF - Journal of Systems Architecture
M1 - 103152
ER -