Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL

2403.18374

Published 4/9/2024 by Marius Meyer, Tobias Kenter, Lucian Petrica, Kenneth O'Brien, Michaela Blott, Christian Plessl

Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL

Abstract

Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.

Create account to get full access

Overview

This paper presents ACCL, a communication framework for optimizing latency-sensitive high-performance computing (HPC) applications running on up to 48 field-programmable gate arrays (FPGAs).
The authors demonstrate how ACCL can improve the performance of latency-sensitive HPC applications by reducing inter-FPGA communication overhead.
The research explores the challenges of achieving low-latency communication between a large number of FPGAs and provides a solution to address these challenges.

Plain English Explanation

In this paper, the researchers introduce a new communication framework called ACCL that is designed to improve the performance of latency-sensitive HPC applications running on multiple FPGAs. FPGAs are a type of specialized hardware that can be programmed to perform specific tasks very quickly, making them useful for HPC applications.

One of the key challenges with using multiple FPGAs for HPC is the need to communicate data between them quickly and efficiently. The researchers recognized that the latency (delay) in this inter-FPGA communication can significantly impact the overall performance of the application. ACCL aims to address this problem by optimizing the communication process to reduce the latency.

The researchers tested ACCL on HPC applications running on up to 48 FPGAs and found that it was able to improve the performance of these applications by reducing the communication overhead. This is significant because it allows HPC applications to take better advantage of the parallelism and processing power offered by using multiple FPGAs, without being held back by communication delays.

The research on understanding the potential of FPGA-based spatial acceleration and the work on optimizing offload performance in heterogeneous MPSoCs provide relevant context for this research on ACCL. Additionally, the findings on more scalable sparse dynamic data exchange and cooperative sensing and communication in ISAC networks may offer insights into the challenges and potential solutions for low-latency communication in HPC systems.

Technical Explanation

The ACCL framework introduced in this paper aims to address the challenge of achieving low-latency communication between a large number of FPGAs in HPC applications. The researchers developed ACCL as a communication infrastructure that can be integrated with high-level synthesis (HLS) tools to optimize the communication for latency-sensitive HPC workloads.

ACCL is designed to provide a scalable and efficient communication solution for HPC applications running on up to 48 FPGAs. The key features of ACCL include:

Hierarchical Communication Architecture: ACCL uses a hierarchical communication architecture to manage the complexity of inter-FPGA communication as the number of FPGAs scales. This involves organizing the FPGAs into a tree-like structure with local communication hubs.
Hardware-Accelerated Communication: ACCL leverages hardware-accelerated communication primitives, such as direct memory access (DMA) and remote direct memory access (RDMA), to minimize the latency of data transfers between FPGAs.
Adaptive Routing and Scheduling: ACCL employs adaptive routing and scheduling algorithms to dynamically optimize the communication paths and schedules based on the current system state and application requirements.

The researchers evaluated the performance of ACCL using a range of HPC benchmarks and applications, including FPGA-based reconfigurable accelerators for convolutional transformer hybrids. The results demonstrate that ACCL can significantly reduce the communication latency and improve the overall application performance, especially for latency-sensitive HPC workloads.

Critical Analysis

The researchers have presented a comprehensive solution for optimizing communication in latency-sensitive HPC applications running on multiple FPGAs. The ACCL framework appears to be a well-designed and effective approach to address the challenges of inter-FPGA communication at scale.

One potential limitation of the research is that it focuses primarily on synthetic benchmarks and HPC applications, and does not explore the performance of ACCL in the context of more diverse real-world workloads. Additionally, the paper does not provide a detailed analysis of the hardware and energy efficiency implications of the ACCL framework, which could be an important consideration for HPC systems.

Further research could also explore the integration of ACCL with other communication optimization techniques, such as those discussed in the work on optimizing offload performance in heterogeneous MPSoCs, to further enhance the performance and scalability of HPC applications on FPGA-based systems.

Conclusion

This paper presents a novel communication framework called ACCL that addresses the challenge of achieving low-latency communication between a large number of FPGAs in HPC applications. The researchers have demonstrated the effectiveness of ACCL in improving the performance of latency-sensitive HPC workloads by optimizing the inter-FPGA communication.

The ACCL framework, with its hierarchical communication architecture, hardware-accelerated primitives, and adaptive routing and scheduling algorithms, offers a promising solution for leveraging the parallelism and processing power of FPGA-based HPC systems. The insights and techniques developed in this research could have broader implications for the design and optimization of communication infrastructure for large-scale heterogeneous computing systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu

Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects.

6/21/2024

cs.LG cs.DC

🛸

CNN-Based Equalization for Communications: Achieving Gigabit Throughput with a Flexible FPGA Hardware Architecture

Jonas Ney, Christoph Fullner, Vincent Lauinger, Laurent Schmalen, Sebastian Randel, Norbert Wehn

To satisfy the growing throughput demand of data-intensive applications, the performance of optical communication systems increased dramatically in recent years. With higher throughput, more advanced equalizers are crucial, to compensate for impairments caused by inter-symbol interference (ISI). The latest research shows that artificial neural network (ANN)-based equalizers are promising candidates to replace traditional algorithms for high-throughput communications. On the other hand, not only throughput but also flexibility is a main objective of beyond-5G and 6G communication systems. A platform that is able to satisfy the strict throughput and flexibility requirements of modern communication systems are field programmable gate arrays (FPGAs). Thus, in this work, we present a high-performance FPGA implementation of an ANN-based equalizer, which meets the throughput requirements of modern optical communication systems. Further, our architecture is highly flexible since it includes a variable degree of parallelism (DOP) and therefore can also be applied to low-cost or low-power applications which is demonstrated for a magnetic recording channel. The implementation is based on a cross-layer design approach featuring optimizations from the algorithm down to the hardware architecture, including a detailed quantization analysis. Moreover, we present a framework to reduce the latency of the ANN-based equalizer under given throughput constraints. As a result, the bit error ratio (BER) of our equalizer for the optical fiber channel is around four times lower than that of a conventional one, while the corresponding FPGA implementation achieves a throughput of more than 40 GBd, outperforming a high-performance graphics processing unit (GPU) by three orders of magnitude for a similar batch size.

5/7/2024

cs.AR cs.LG eess.SP

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

4/9/2024

cs.LG cs.AI cs.AR cs.CL

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo, Xiping Hu

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and infrastructures. Specifically, we first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training. Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference. After that, we present the latest technologies pertaining to modern communication infrastructures used in distributed deep learning with a focus on examining the impact of the communication overhead in a large-scale and heterogeneous setting. Finally, we conduct a case study on the distributed training of large language models at a large scale to illustrate how to apply these technologies in real cases. This article aims to offer researchers a comprehensive understanding of the current landscape of large-scale distributed deep learning and to reveal promising future research directions toward communication-efficient solutions in this scope.

4/10/2024

cs.DC cs.AI