PaScaL_TDMA 2.0: A multi-GPU-based library for solving massive tridiagonal systems

Mingyu Yang, Ji Hoon Kang, Ki Ha Kim, Oh Kyoung Kwon, Jung Il Choi

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

We introduce an updated library, PaScaL_TDMA 2.0, which was originally designed for the efficient computation of batched tridiagonal systems and is now capable of exploiting multi-GPU environments. The library extends its functionality to include GPU support and minimizes CPU-GPU data transfer by utilizing the device-resident memory while retaining the original CPU-based capabilities. The library employs pipeline copying with shared memory for low-latency memory access and incorporates CUDA-aware MPI for efficient multi-GPU communication. Our GPU implementation demonstrated outstanding computational performance compared to the original CPU implementation while consuming much less energy. In summary, this updated version presents a time-efficient and energy-saving approach for solving batched tridiagonal systems on modern computing platforms, including both GPU and CPU. New version program summary: Program Title: PaScaL_TDMA 2.0 CPC Library link to program files: https://doi.org/10.17632/49z6fh94z3.2 Developer's repository link: https://github.com/MPMC-Lab/PaScaL_TDMA Licensing provisions: MIT Programming language: CUDA Fortran. The program was tested using NVIDIA HPC SDK 22.7. Journal reference of previous version: Comput. Phys. Comm. 260 (2021), 107722 Does the new version supersede the previous version?: Yes Reasons for the new version: This version supports multi-GPU acceleration for solving batched tridiagonal systems of equations using the modified Thomas algorithm, which was originally implemented in the PaScaL_TDMA library. CUDA Fortran is used for the current implementation of PaScaL_TDMA to exploit the unique features of GPU, such as shared memory and CUDA-aware MPI. Summary of revisions: PaScaL_TDMA 2.0 is a versatile library designed to solve many tridiagonal systems in multi-dimensional partial differential equations on both CPU and GPU platforms. It builds upon the original CPU version of the PaScaL_TDMA library initially proposed by Kim et al. [1] and extends its functionality to enable GPU acceleration. Our updated library is equipped with several modifications to enhance its performance on multi-GPU platforms. First, all variables of the tridiagonal matrix algorithm (TDMA) in the GPU implementation are operated in the device-resident memory, minimizing the transfers between the host and GPU devices. Second, to accelerate GPU computation, we incorporated the CUDA kernel into the loop structure of the existing algorithm, utilizing pipeline copy techniques in shared memory during the forward elimination and backward substitution steps of TDMA. Consequently, PaScaL_TDMA 2.0, which minimizes global memory access, significantly improves performance. Furthermore, the library implements CUDA-aware MPI communication, thereby increasing parallel efficiency. This communication technique enables fast communication in systems such as NVlink, which supports direct GPU-to-GPU communication. Finally, the sequential Thomas algorithm [2] is employed instead of the PaScaL_TDMA algorithm to avoid unnecessary processes when only a single process is involved, without domain partitioning for both the CPU and GPU codes. We evaluated the computational performance and energy efficiency of the GPU implementation of PaScaL_TDMA 2.0 on the NEURON cluster at the Korea Institute of Science and Technology Information (KISTI). The cluster consisted of two AMD EPYC 7543 processors (hosts) and eight NVLink-connected NVIDIA A100 GPUs (devices) per compute node. The results were compared with those obtained on the NURION cluster at KISTI, which features an Intel Xeon Phi 7250 Knight Landing (KNL) processor per compute node. Intel OneAPI 22.2 [3] and NVIDIA HPC SDK 22.7 [4] were used to compile PaScaL_TDMA 2.0 on the NURION and NEURON clusters, respectively. In our evaluation, for the KNL configuration, we used 64 cores per CPU, whereas, for the AMD configuration, we used cores corresponding to the number of GPUs. The cores are decomposed using the method proposed by Kim et al. [1]. Figure 1(a) presents the wall-clock time results against the number of CPUs/GPUs with two different grid sizes, 5123 and 10243, where all the results show strong scalability, regardless of the grid sizes or versions of the CPU/GPU. Remarkably, the computational performance of the A100 GPUs surpassed that of the KNL many-core CPUs, achieving a 4.34x and 6.43x speedup on average with 5123 and 10243 grid points, respectively. This GPU implementation exhibits strong scalability, even beyond eight GPUs, which implies that it can handle internode communications effectively. Figure 1(b) shows the energy consumed by KNL CPUs and A100 GPUs for solving tridiagonal systems with grid sizes of 5123 and 10243. In the case of the results with A100 GPUs, the energy consumed by the AMD EPYC CPUs is also plotted. The energy consumption was evaluated using a time integral of the instantaneous power consumption, which was measured using the turbostat utility [5] for the CPU and the NVIDIA Management Library (NVML) [6] for GPU, respectively. In the case of 5123 grid points, an execution on A100 GPUs consumes only 8.5% of the energy required by KNL CPUs, achieving 11.8x more efficient computation than execution on KNL CPUs. Consistency was observed in the case of 10243 grid points, where A100 GPUs enable 11.6x energy-efficient executions than KNL CPUs, requiring 8.6% of the energy used by execution on KNL CPUs. These findings highlight the benefits of using this upgraded version for GPU clusters in terms of compute performance and energy consumption compared to the original version. Nature of problem: This library solves batched tridiagonal systems involving multi-dimensional partial differential equations. Solution method: The divide-and-conquer approach was employed to solve partitioned tridiagonal systems of equations in distributed memory systems. The modified Thomas algorithm for partitioned submatrices was applied to transform the systems into the modified forms, which subsequently constructed reduced tridiagonal systems through all-to-all communication. The reduced tridiagonal systems were solved using the sequential Thomas algorithm, whereas the solutions were distributed to update the remaining unknowns in the partitioned systems. The detailed computational procedures are described in Kim et al. [1], and all procedures were implemented using CUDA Fortran in this updated version. References: [1] K.-H. Kim, J.-H. Kang, X. Pan, J.-I. Choi, Comput. Phys. Commun. 260 (2021) 107722, https://doi.org/10.1016/j.cpc.2020.107722. [2] L.H. Thomas, Watson Sci. Comput. Lab. Rept., Columbia University, New York 1 (1949) 71. [3] https://software.intel.com/content/www/us/en/develop/tools/oneapi.html. [4] https://docs.nvidia.com/hpc-sdk/index.html. [5] https://github.com/torvalds/linux/tree/master/tools/power/x86/turbostat. [6] https://docs.nvidia.com/deploy/nvml-api.

Original languageEnglish
Article number108785
JournalComputer Physics Communications
Volume290
DOIs
Publication statusPublished - 2023 Sept

Bibliographical note

Publisher Copyright:
© 2023 Elsevier B.V.

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • General Physics and Astronomy

Fingerprint

Dive into the research topics of 'PaScaL_TDMA 2.0: A multi-GPU-based library for solving massive tridiagonal systems'. Together they form a unique fingerprint.

Cite this