Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

2404.13195

Published 5/2/2024 by Junjie Li, Yinzhi Wang, Xiao Liang, Hang Liu

🚀

Abstract

Porting codes to GPU often requires major efforts. While several tools exist for automatically offload numerical libraries such as BLAS and LAPACK, they often prove impractical due to the high cost of mandatory data transfer. The new unified memory architecture in NVIDIA Grace-Hopper allows high bandwidth cache-coherent memory access of all memory from both CPU and GPU, potentially eliminating bottleneck faced in conventional architecture. This breakthrough opens up new avenues for application development and porting strategies. In this study, we introduce a new tool for automatic BLAS offload, the tool leverages the high speed cache coherent NVLink C2C interconnect in Grace-Hopper, and enables performant GPU offload for BLAS heavy applications with no code changes or recompilation. The tool was tested on two quantum chemistry or physics codes, great performance benefits were observed.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper explores the automatic offloading of Basic Linear Algebra Subprograms (BLAS) operations to the GPU on NVIDIA's Grace-Hopper unified memory architecture.
The researchers investigate the performance benefits and challenges of this automatic offloading approach.
The study provides insights into the optimization of offloading strategies for heterogeneous systems like the Grace-Hopper platform.

Plain English Explanation

The paper looks at a way to automatically offload certain math calculations from the CPU to the GPU on a new computer chip called the NVIDIA Grace-Hopper. The Grace-Hopper has a unified memory architecture, which means the CPU and GPU can easily share data. The researchers wanted to see if they could get a performance boost by automatically sending common math tasks, like matrix multiplication, to the more powerful GPU instead of running them on the CPU.

They tested this automatic offloading approach to see how well it worked and what challenges might come up. This gives insights into how to best optimize the use of both the CPU and GPU on this type of heterogeneous (mixed CPU and GPU) system to get the best performance.

Technical Explanation

The paper explores automatic BLAS offloading on the NVIDIA Grace-Hopper unified memory architecture. BLAS (Basic Linear Algebra Subprograms) are a collection of low-level routines for performing common vector and matrix operations. Offloading these BLAS operations from the CPU to the GPU can potentially provide significant performance benefits on heterogeneous systems like the Grace-Hopper.

The researchers investigate the challenges and opportunities of this automatic offloading approach. They conduct experiments to analyze the performance impacts and identify optimization strategies. The study provides insights into offloading to general-purpose compute units and the evaluation of programming models for heterogeneous systems.

The findings suggest that the automatic offloading can provide substantial speedups for certain BLAS operations, but also highlight the need for careful data management and synchronization to fully unleash the potential of the Grace-Hopper's heterogeneous architecture.

Critical Analysis

The paper presents a thorough investigation of the automatic BLAS offloading on the NVIDIA Grace-Hopper platform. However, the study is limited to a specific set of BLAS operations and workloads. The authors acknowledge that the performance benefits may vary for different types of applications and data patterns.

Additionally, the paper does not delve deeply into the trade-offs between offloading to the GPU and the overhead of data movement and synchronization. Further research is needed to understand the broader implications of this automatic offloading approach and its impact on a wider range of workloads and architectures.

Conclusion

The study on automatic BLAS offloading on the NVIDIA Grace-Hopper unified memory architecture provides valuable insights into the optimization of offloading strategies for heterogeneous systems. The findings demonstrate the potential performance benefits of automatically leveraging the GPU for common linear algebra operations, but also highlight the challenges around data management and synchronization that must be addressed to fully harness the capabilities of these emerging architectures.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Optimizing Offload Performance in Heterogeneous MPSoCs

Luca Colagrande, Luca Benini

Heterogeneous multi-core architectures combine a few host cores, optimized for single-thread performance, with many small energy-efficient accelerator cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small and fine-grained parallel tasks. We demonstrate that by co-designing the hardware and offload routines, we can increase the speedup of an offloaded DAXPY kernel by as much as 47.9%. Furthermore, we show that it is possible to accurately model the runtime of an offloaded application, accounting for the offload overheads, with as low as 1% MAPE error, enabling optimal offload decisions under offload execution time constraints.

4/3/2024

cs.AR cs.DC

Taking GPU Programming Models to Task for Performance Portability

Joshua H. Davis, Pranav Sivaraman, Joy Kitson, Konstantinos Parasyris, Harshitha Menon, Isaac Minn, Giorgis Georgakoudis, Abhinav Bhatele

Portability is critical to ensuring high productivity in developing and maintaining scientific software as the diversity in on-node hardware architectures increases. While several programming models provide portability for diverse GPU platforms, they don't make any guarantees about performance portability. In this work, we explore several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL, to study if the performance of these models is consistently good across NVIDIA and AMD GPUs. We use five proxy applications from different scientific domains, create implementations where missing, and use them to present a comprehensive comparative evaluation of the programming models. We provide a Spack scripting-based methodology to ensure reproducibility of experiments conducted in this work. Finally, we attempt to answer the question -- to what extent does each programming model provide performance portability for heterogeneous systems in real-world usage?

5/22/2024

cs.DC cs.PF

UDON: A case for offloading to general purpose compute on CXL memory

Jon Hermes, Josh Minor, Minjun Wu, Adarsh Patil, Eric Van Hensbergen

Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload. We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices. Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$times$ performance improvement with under 10% offload overhead.

4/4/2024

cs.ET

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Baodi Shan, Mauricio Araya-Polo

Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU's architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1x.

4/9/2024

cs.DC