TY - JOUR
T1 - A convertible neural processor supporting adaptive quantization for real-time neural networks
AU - Kal, Hongju
AU - Choi, Hyoseong
AU - Jeong, Ipoom
AU - Yang, Joon Sung
AU - Ro, Won Woo
N1 - Publisher Copyright:
© 2023
PY - 2023/12
Y1 - 2023/12
N2 - This paper presents Stimpack, a responsive approach toward adaptive neural processing unit (NPU), aiming to satisfy service-level objectives (SLOs) under highly massive loads of neural network (NN) inference. Initially, Stimpack operates in a base-mode processing and computes NNs similarly to a conventional NPU accelerator. During base-mode processing, if SLO violation is likely to occur soon, it changes the operation mode to a burst-mode processing. In the burst-mode processing, Stimpack computes quantized networks instead of the original ones, enabling computational throughput to be scaled up to two times higher than the base-mode processing. This switchable processing is facilitated by three hardware/software schemes. First, a reconfigurable core is adopted to support two different precisions and boost processing throughput for avoiding SLO violations. As computation resources can be shared between the two operation modes, the area overhead of a reconfigurable core is negligible. The second is an on-chip quantization unit that mitigates the data transfer overhead incurred by mode changing. It quantizes parameters stored in on-chip memory on the fly, instead of bringing quantized parameters from off-chip memory. Third, Stimpack leverages a scheduler that determines mode switching based on the amount of workloads in the server. By monitoring ongoing and queued requests of NPU, the scheduler conservatively activates burst-mode processing to minimize accuracy loss. Our analysis shows that, compared to a state-of-the-art NPU, Stimpack achieves 48.4% speedup and allows a 41.4% large load on average while satisfying SLO and near-ideal accuracy.
AB - This paper presents Stimpack, a responsive approach toward adaptive neural processing unit (NPU), aiming to satisfy service-level objectives (SLOs) under highly massive loads of neural network (NN) inference. Initially, Stimpack operates in a base-mode processing and computes NNs similarly to a conventional NPU accelerator. During base-mode processing, if SLO violation is likely to occur soon, it changes the operation mode to a burst-mode processing. In the burst-mode processing, Stimpack computes quantized networks instead of the original ones, enabling computational throughput to be scaled up to two times higher than the base-mode processing. This switchable processing is facilitated by three hardware/software schemes. First, a reconfigurable core is adopted to support two different precisions and boost processing throughput for avoiding SLO violations. As computation resources can be shared between the two operation modes, the area overhead of a reconfigurable core is negligible. The second is an on-chip quantization unit that mitigates the data transfer overhead incurred by mode changing. It quantizes parameters stored in on-chip memory on the fly, instead of bringing quantized parameters from off-chip memory. Third, Stimpack leverages a scheduler that determines mode switching based on the amount of workloads in the server. By monitoring ongoing and queued requests of NPU, the scheduler conservatively activates burst-mode processing to minimize accuracy loss. Our analysis shows that, compared to a state-of-the-art NPU, Stimpack achieves 48.4% speedup and allows a 41.4% large load on average while satisfying SLO and near-ideal accuracy.
KW - Approximate computing
KW - Neural network
KW - Neural processing unit
KW - Quantization
KW - Service-level agreement
UR - http://www.scopus.com/inward/record.url?scp=85176910301&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85176910301&partnerID=8YFLogxK
U2 - 10.1016/j.sysarc.2023.103025
DO - 10.1016/j.sysarc.2023.103025
M3 - Article
AN - SCOPUS:85176910301
SN - 1383-7621
VL - 145
JO - Journal of Systems Architecture
JF - Journal of Systems Architecture
M1 - 103025
ER -