Prediction-based error correction for gpu reliability with low overhead

Hyunyul Lim, Tae Hyun Kim, Sungho Kang

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Scientific and simulation applications are continuously gaining importance in many fields of research and industries. These applications require massive amounts of memory and substantial arithmetic computation. Therefore, general-purpose computing on graphics processing units (GPGPU), which combines the computing power of graphics processing units (GPUs) and general CPUs, have been used for computationally intensive scientific and big data processing applications. Because current GPU architectures lack hardware support for error detection in computation logic, GPGPU has low reliability. Unlike graphics applications, errors in GPGPU can lead to serious problems in general-purpose computing applications. These applications are often intertwined with human life, meaning that errors can be life threatening. Therefore, this paper proposes a novel prediction-based error correction method called Prediction-based Error Correction (PRECOR) for GPU reliability, which detects and corrects errors in GPGPU platforms with a focus on errors in computational elements. The implementation of the proposed architecture needs a small number of checkpoint buffers in order to fix errors in computational logic. The PRECOR architecture has prediction buffers and controller units for predicting erroneous outputs before performing rollback. Following a rollback, the architecture confirms the accuracy of its predictions. The proposed method effectively reduces the hardware and time overheads required to correct errors. Experimental results confirm that PRECOR efficiently fixes errors with low hardware and time overheads.

Original languageEnglish
Article number1849
Pages (from-to)1-18
Number of pages18
JournalElectronics (Switzerland)
Volume9
Issue number11
DOIs
Publication statusPublished - 2020 Nov

Bibliographical note

Publisher Copyright:
© 2020 by the authors. Licensee MDPI, Basel, Switzerland.

All Science Journal Classification (ASJC) codes

  • Control and Systems Engineering
  • Signal Processing
  • Hardware and Architecture
  • Computer Networks and Communications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Prediction-based error correction for gpu reliability with low overhead'. Together they form a unique fingerprint.

Cite this