Error resilience is the primary design concern for safety- and mission-critical applications. Redundant MultiThreading (RMT) is one of the most promising soft and hard error resilience strategies because it does not require additional hardware modification. While the state-of-the-art software RMT scheme can achieve a high degree of error protection, our detailed investigation revealed that it suffers from performance overhead and insufficient fault coverage. This paper proposes EXPERTISE, a compiler-level RMT scheme that can detect the manifestation of hardware faults in all processor components. EXPERTISE transformation generates a checker-thread for the main execution thread. These redundant threads are executed simultaneously on two physically different cores of a multicore processor and perform almost the same computations. After each memory write operation is committed by the main-thread, the checker-thread loads back the written data from the memory and checks it against its own locally computed values. If they match, the execution continues. Otherwise, the error flag is raised. In order to evaluate the effectiveness of the proposed solution, we performed soft and hard error injection experiments on all the different hardware components of an ARM Cortex53-like μ-architecturally simulated microprocessor. Based on statistical fault injection campaigns, we have found that EXPERTISE provides 188× better fault coverage with 27% faster performance as compared to the state-of-the-art scheme.
|Journal||ACM Transactions on Architecture and Code Optimization|
|Publication status||Published - 2022 Sept 16|
Bibliographical noteFunding Information:
This work was partially supported by funding from National Science Foundation Grants No. CNS 1525855, CPS 1646235, CCF 1723476 - the NSF/Intel joint research center for Computer Assisted Programming for Heterogeneous Architectures (CAPA), Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-00155, Context and Activity Analysis-based Solution for Safe Childcare), National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2022-00165225), and Samsung Electronics Co., Ltd. (FOUNDRY-202108DD007F)
© 2022 Copyright held by the owner/author(s).
All Science Journal Classification (ASJC) codes
- Information Systems
- Hardware and Architecture