TY - JOUR
T1 - CASH-RF
T2 - A Compiler-Assisted Hierarchical Register File in GPUs
AU - Oh, Yunho
AU - Jeong, Ipoom
AU - Ro, Won Woo
AU - Yoon, Myung Kuk
N1 - Publisher Copyright:
© 2009-2012 IEEE.
PY - 2022/12/1
Y1 - 2022/12/1
N2 - Spin-Transfer torque magnetic random-Access memory (STT-MRAM) is an emerging nonvolatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a graphics processing unit (GPU) register file (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based register cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a compiler-Assisted hierarchical RF in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.
AB - Spin-Transfer torque magnetic random-Access memory (STT-MRAM) is an emerging nonvolatile memory technology that has been received significant attention due to its higher density and lower leakage current over SRAM. One compelling use case is to employ STT-MRAM as a graphics processing unit (GPU) register file (RF) to reduce its massive energy consumption. One critical challenge is that STT-MRAM has longer access latency and higher dynamic power consumption than SRAM, which motivates the hierarchical RF that places a small SRAM-based register cache (RC) between functional units and STT-MRAM RF. The RC acts as the write buffer, so all the writes on the RF are first performed on the RC. In the presence of a conflict miss, the RC writes back the corresponding cache line into the RF. In this work, we observe that a large amount of such write-back operations are unnecessary because they include register values that are never used again. Leveraging this observation, we propose a compiler-Assisted hierarchical RF in GPUs (CASH-RF) that optimizes STT-MRAM accesses by removing dead register values. In CASH-RF, unnecessary write-back operations are detected by the compiler via the last consumer analysis. At runtime, the corresponding RC lines are discarded after the last references without being updated to the RF. Compared to the baseline GPUs, CASH-RF removes 59.5% of write-back operations, which leads to 54.7% lower RF energy consumption with only 2.6% of performance degradation.
KW - Compiler-Assisted
KW - SRAM
KW - graphics processing units (GPUs)
KW - hierarchical register file (RF)
KW - spin-Transfer torque magnetic random-Access memory (STT-MRAM)
UR - http://www.scopus.com/inward/record.url?scp=85127517131&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127517131&partnerID=8YFLogxK
U2 - 10.1109/LES.2022.3163749
DO - 10.1109/LES.2022.3163749
M3 - Article
AN - SCOPUS:85127517131
SN - 1943-0663
VL - 14
SP - 187
EP - 190
JO - IEEE Embedded Systems Letters
JF - IEEE Embedded Systems Letters
IS - 4
ER -