gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Read original: arXiv:2308.05199 - Published 5/8/2024 by Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu and 4 others

🛸

Overview

GPU-aware collective communication has become a major bottleneck as GPU computing power rapidly rises
Traditional approaches that directly integrate lossy compression into GPU-aware collectives can lead to performance issues and uncontrolled data distortion
This paper proposes a new framework called gZCCL that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation

Plain English Explanation

As modern computing platforms have become more powerful, with GPUs playing a central role, the way different components of these systems communicate with each other has become a major challenge. Traditionally, researchers have tried to address this by directly integrating lossy compression techniques into the communication protocols used between GPUs. However, this can result in serious performance problems, such as underutilized GPUs and unpredictable errors in the data being transmitted.

To tackle these issues, the authors of this paper have developed a new framework called gZCCL. This framework is designed to optimize the communication between GPUs in a way that maintains high accuracy while also achieving significant performance improvements. The key idea is to carefully control the amount of compression and error introduced, so that the overall system remains stable and reliable.

The researchers evaluated their gZCCL framework using real-world applications and datasets running on up to 512 NVIDIA A100 GPUs. Their results show that the gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform the widely-used NCCL and Cray MPI libraries by up to 4.5X and 28.7X, respectively. Furthermore, their accuracy evaluation with an image-stacking application confirmed that the gZCCL framework is able to maintain high-quality reconstructed data, despite the use of compression.

Technical Explanation

The paper introduces a new framework called gZCCL that aims to address the performance and accuracy challenges of using lossy compression in GPU-aware collective communication. The key components of the gZCCL framework are:

Accuracy-aware design: gZCCL employs a novel approach to control the error propagation caused by lossy compression, ensuring that the reconstructed data maintains high quality. This is achieved through careful error tracking and management techniques.
GPU-aware optimizations: The framework is designed to take full advantage of GPU hardware features, such as tensor cores and memory hierarchies, to optimize the communication performance.
Support for various collectives: gZCCL supports both collective computation (e.g., Allreduce) and collective data movement (e.g., Scatter) operations, allowing it to be broadly applicable to a wide range of parallel computing workloads.

The researchers evaluated the performance and accuracy of gZCCL using real-world applications and datasets, running on up to 512 NVIDIA A100 GPUs. Their results show that the gZCCL-accelerated collectives can outperform the widely-used NCCL and Cray MPI libraries by significant margins, while also maintaining high-quality reconstructed data.

Critical Analysis

The paper presents a compelling and well-designed framework for addressing the performance and accuracy challenges of using lossy compression in GPU-aware collective communication. The authors have clearly put a lot of thought into the design and implementation of gZCCL, and their experimental results are very promising.

One potential area for further research could be exploring the integration of more advanced compression techniques, such as those proposed in the GWLZ paper, to further improve the compression ratio and communication performance. Additionally, it would be interesting to see how the gZCCL framework might be extended to support decentralized optimization approaches, which are becoming increasingly important in modern machine learning applications.

Another area for potential improvement could be the development of more comprehensive performance models to better predict the behavior of the gZCCL framework under different workloads and hardware configurations. This could help users optimize the use of the framework for their specific needs.

Overall, the gZCCL framework represents a significant advancement in the field of GPU-aware collective communication, and the authors have done an excellent job of demonstrating its capabilities. With further research and development, this work has the potential to significantly improve the performance and efficiency of latency-sensitive HPC applications.

Conclusion

The gZCCL framework proposed in this paper represents a novel and effective approach to addressing the performance and accuracy challenges of using lossy compression in GPU-aware collective communication. By employing an accuracy-aware design and leveraging GPU-specific optimizations, the framework is able to deliver significant performance improvements over traditional solutions while maintaining high-quality reconstructed data.

The researchers' comprehensive evaluation, which included real-world applications and datasets running on large-scale GPU clusters, showcases the practical relevance and effectiveness of the gZCCL framework. This work has the potential to drive further advancements in the field of parallel and distributed computing, ultimately enabling more efficient and powerful computing platforms for a wide range of applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🛸

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Jinyang Liu, Yafan Huang, Ken Raffenetti, Hui Zhou, Kai Zhao, Xiaoyi Lu, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. A traditional approach is to directly integrate lossy compression into GPU-aware collectives, which can lead to serious performance issues such as underutilized GPU devices and uncontrolled data distortion. In order to address these issues, in this paper, we propose gZCCL, a first-ever general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

5/8/2024

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP) are the three strategies widely adopted to enable fast and efficient Large Language Model (LLM) training. However, these approaches rely on data-intensive communication routines to collect, aggregate, and re-distribute gradients, activations, and other important model information, which pose significant overhead. Co-designed with GPU-based compression libraries, MPI libraries have been proven to reduce message size significantly, and leverage interconnect bandwidth, thus increasing training efficiency while maintaining acceptable accuracy. In this work, we investigate the efficacy of compression-assisted MPI collectives under the context of distributed LLM training using 3D parallelism and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen supercomputer. First, we enabled a naive compression scheme across all collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6% increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a strategy ignores the sparsity discrepancy among messages communicated in each parallelism degree, thus introducing more errors and causing degradation in training loss. Therefore, we incorporated hybrid compression settings toward each parallel dimension and adjusted the compression intensity accordingly. Given their low-rank structure (arXiv:2301.02654), we apply aggressive compression on gradients when performing DP All-reduce. We adopt milder compression to preserve precision while communicating activations, optimizer states, and model parameters in TP and PP. Using the adjusted hybrid compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a 12.7% increase in samples per second while reaching baseline loss convergence.

9/5/2024

HiCCL: A Hierarchical Collective Communication Library

Mert Hidayetoglu, Simon Garcia de Gonzalo, Elliott Slaughter, Pinku Surana, Wen-mei Hwu, William Gropp, Alex Aiken

HiCCL (Hierarchical Collective Communication Library) addresses the growing complexity and diversity in high-performance network architectures. As GPU systems have envolved into networks of GPUs with different multilevel communication hierarchies, optimizing each collective function for a specific system has become a challenging task. Consequently, many collective libraries struggle to adapt to different hardware and software, especially across systems from different vendors. HiCCL's library design decouples the collective communication logic from network-specific optimizations through a compositional API. The communication logic is composed using multicast, reduction, and fence primitives, which are then factorized for a specified network hieararchy using only point-to-point operations within a level. Finally, striping and pipelining optimizations applied as specified for streamlining the execution. Performance evaluation of HiCCL across four different machines$unicode{x2014}$two with Nvidia GPUs, one with AMD GPUs, and one with Intel GPUs$unicode{x2014}$demonstrates an average 17$times$ higher throughput than the collectives of highly specialized GPU-aware MPI implementations, and competitive throughput with those of vendor-specific libraries (NCCL, RCCL, and OneCCL), while providing portability across all four machines.

8/13/2024

🛸

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

Kishore Punniyamurthy, Khaled Hamidouche, Bradford M. Beckmann

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with dependent collective communication by leveraging GPUs' massive parallelism and GPU-initiated communication. We have developed self-contained GPU kernels where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. We demonstrate our approach by creating three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the pervasive communication overheads observed in DLRM, Transformers and MoE model architectures. In order to demonstrate that our approach can be integrated into ML frameworks for wide adoption in production environments, we expose our fused operators as new PyTorch operators as well as extend the Triton framework to enable them. Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches. Our scale-up GEMV + AllReduce and GEMM + All-to-All implementations achieve up to 22% and 20% lower execution time, while our fused embedding + All-to-All reduces execution time by 20% and 31% for intra-node and inter-node configurations. Large scale-out simulations indicate that our approach reduces DLRM execution time by 21% for 128 node system.

4/24/2024