Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Read original: arXiv:2408.14090 - Published 8/27/2024 by Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi and 4 others

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Overview

Explores GPU-to-GPU communication and insights into supercomputer interconnects
Provides a technical explanation and critical analysis of the research
Covers experiment design, architecture, and key insights
Discusses limitations and areas for further research

Plain English Explanation

This paper investigates how GPUs (graphics processing units) communicate with each other in high-performance computing systems, such as supercomputers. GPUs are powerful processors that are commonly used for tasks like machine learning and scientific simulations. However, for these complex applications, GPUs need to be able to quickly share data with each other.

The researchers in this study looked at different ways that GPUs can be connected and how that affects their ability to communicate efficiently. They tested various interconnect technologies, which are the physical connections that allow the GPUs to transfer data. The goal was to understand the strengths and weaknesses of these interconnect options and provide insights that could help improve the design of future supercomputer systems.

Technical Explanation

The researchers conducted experiments using different supercomputer architectures, including systems with NVLink, InfiniBand, and PCIe interconnects. They measured various performance metrics, such as latency, bandwidth, and the time required to complete certain data-intensive tasks.

The results showed that the choice of interconnect technology had a major influence on the GPU-to-GPU communication performance. For example, the NVLink interconnect provided significantly higher bandwidth than InfiniBand or PCIe, allowing for faster data transfer between GPUs. However, the latency was lower with InfiniBand, which could be important for certain applications.

Critical Analysis

The paper provides a comprehensive analysis of GPU-to-GPU communication and offers valuable insights for the design of future supercomputer systems. However, it also acknowledges several limitations and areas for further research.

One limitation is that the experiments were conducted on a limited set of hardware configurations and interconnect technologies. The researchers suggest that expanding the scope of the study to include a wider range of systems and interconnects could provide additional insights.

Conclusion

This paper provides valuable insights into the challenges and opportunities of GPU-to-GPU communication in high-performance computing systems. The researchers have identified key factors that influence the performance of these interconnects, including the choice of technology, system architecture, and workload characteristics.

Related Papers

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, Torsten Hoefler

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

8/27/2024

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Xu Zhang, Ke Liu, Yisong Chang, Hui Yuan, Xiaolong Zheng, Ke Zhang, Mingyu Chen

Emerging interconnects, such as CXL and NVLink, have been integrated into the intra-host topology to scale more accelerators and facilitate efficient communication between them, such as GPUs. To keep pace with the accelerator's growing computing throughput, the interconnect has seen substantial enhancement in link bandwidth, e.g., 256GBps for CXL 3.0 links, which surpasses Ethernet and InfiniBand network links by an order of magnitude or more. Consequently, when data-intensive jobs, such as LLM training, scale across multiple hosts beyond the reach limit of the interconnect, the performance is significantly hindered by the limiting bandwidth of the network infrastructure. We address the problem by proposing DFabric, a two-tier interconnect architecture. We address the problem by proposing DFabric, a two-tier interconnect architecture. First, DFabric disaggregates rack's computing units with an interconnect fabric, i.e., CXL fabric, which scales at rack-level, so that they can enjoy intra-rack efficient interconnecting. Second, DFabric disaggregates NICs from hosts, and consolidates them to form a NIC pool with CXL fabric. By providing sufficient aggregated capacity comparable to interconnect bandwidth, the NIC pool bridges efficient communication across racks or beyond the reach limit of interconnect fabric. However, the local memory accessing becomes the bottleneck when enabling each host to utilize the NIC pool efficiently. To the end, DFabric builds a memory pool with sufficient bandwidth by disaggregating host local memory and adding more memory devices. We have implemented a prototype of DFabric that can run applications transparently. We validated its performance gain by running various microbenchmarks and compute-intensive applications such as DNN and graph.

9/10/2024

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip

Luigi Fusco, Mikhail Khalilov, Marcin Chrapek, Giridhar Chukkapalli, Thomas Schulthess, Torsten Hoefler

Heterogeneous supercomputers have become the standard in HPC. GPUs in particular have dominated the accelerator landscape, offering unprecedented performance in parallel workloads and unlocking new possibilities in fields like AI and climate modeling. With many workloads becoming memory-bound, improving the communication latency and bandwidth within the system has become a main driver in the development of new architectures. The Grace Hopper Superchip (GH200) is a significant step in the direction of tightly coupled heterogeneous systems, in which all CPUs and GPUs share a unified address space and support transparent fine grained access to all main memory on the system. We characterize both intra- and inter-node memory operations on the Quad GH200 nodes of the new Swiss National Supercomputing Centre Alps supercomputer, and show the importance of careful memory placement on example workloads, highlighting tradeoffs and opportunities.

8/27/2024

🤿

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.

8/12/2024

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Overview

Plain English Explanation

Related Link: Understanding Data Movement in Tightly Coupled Heterogeneous Systems

Technical Explanation

Related Link: Scaling Deep Learning Computation over Inter-Core Communication Bottlenecks

Critical Analysis

Related Link: FLUX: Fast Software-Based Communication Overlap for GPUs

Conclusion

Related Link: Scaling to 32 GPUs: A Novel Composable System

Related Link: Towards Universal Performance Modeling for Machine Learning Training

Related Papers

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

DFabric: Scaling Out Data Parallel Applications with CXL-Ethernet Hybrid Interconnects

Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor