Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

Read original: arXiv:2408.04808 - Published 8/12/2024 by Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

🤿

Overview

AI chips are incorporating more parallel cores to scale up deep learning (DL) computing
These cores can directly access fast memory in other cores, enabling new parallel computing paradigms
However, current DL compilers lack proper support for these new inter-core communication capabilities

Plain English Explanation

The paper presents T10, a new DL compiler that can effectively leverage the high-bandwidth and low-latency interconnections between cores on AI chips. This allows T10 to exploit the distributed on-chip memory and inter-core communication capabilities of these advanced AI architectures.

Traditionally, DL computations have been executed on a single core or a few cores working together. However, modern AI chips are incorporating numerous parallelized cores to scale up DL computing power. These cores can now directly access the fast local memory of other cores on the chip, enabling new parallel computing paradigms.

Unfortunately, current DL compilers have not kept up with these hardware advancements. They lack the necessary support for the scalable inter-core connections, making it difficult for developers to fully utilize the benefits of this new architecture.

Technical Explanation

T10 introduces a distributed tensor abstraction called rTensor to formulate the computation and communication patterns of tensor operators on this new AI chip architecture. T10 then maps a DNN model to execution plans with a generalized compute-shift pattern, where the DNN computation is partitioned into sub-operators and mapped to different cores. This allows the cores to exchange data following predictable patterns.

T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, and selects the best execution plan from a vast optimization space. This helps alleviate unnecessary inter-core communications, leading to significant performance improvements.

Critical Analysis

The paper demonstrates the potential benefits of compiler-level optimizations for exploiting the advanced capabilities of modern AI chips. However, the evaluations are limited to a single hardware platform, the Graphcore IPU. Further research is needed to understand the generalizability of the T10 approach to other types of AI chips and architectures.

Additionally, the paper does not discuss the potential complexity or engineering challenges involved in implementing the T10 compiler. Integrating such a specialized compiler into existing DL development workflows may require significant effort and coordination with hardware vendors.

Conclusion

The T10 compiler represents an important step towards leveraging the full potential of advanced AI chip architectures with scalable inter-core communication. By introducing a distributed tensor abstraction and generalized compute-shift patterns, T10 can achieve up to 3.3x performance improvement compared to state-of-the-art compilers. This research highlights the critical role of compiler-level optimizations in unlocking the future of highly scalable and efficient deep learning computations.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🤿

Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor

Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue, Jian Huang

As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 makes globally optimized trade-offs between on-chip memory consumption and inter-core communication overhead, selects the best execution plan from a vast optimization space, and alleviates unnecessary inter-core communications. Our evaluation with a real inter-core connected AI chip, the Graphcore IPU, shows up to 3.3$times$ performance improvement, and scalability support for larger models, compared to state-of-the-art DL compilers and vendor libraries.

8/12/2024

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Feng Liang, Zhen Zhang, Haifeng Lu, Victor C. M. Leung, Yanyi Guo, Xiping Hu

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and infrastructures. Specifically, we first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training. Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference. After that, we present the latest technologies pertaining to modern communication infrastructures used in distributed deep learning with a focus on examining the impact of the communication overhead in a large-scale and heterogeneous setting. Finally, we conduct a case study on the distributed training of large language models at a large scale to illustrate how to apply these technologies in real cases. This article aims to offer researchers a comprehensive understanding of the current landscape of large-scale distributed deep learning and to reveal promising future research directions toward communication-efficient solutions in this scope.

4/10/2024

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, Tiziano De Matteis, Zebin Ren, Luigi Fusco, Matteo Turisini, Daniele Cesarini, Kurt Lust, Animesh Trivedi, Duncan Roweth, Filippo Spiga, Salvatore Di Girolamo, Torsten Hoefler

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.

8/27/2024

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, making communication a larger portion of the overall training time. Consequently, optimizing communication for distributed training has become crucial. In this article, we briefly introduce the general architecture of distributed deep neural network training and analyze relationships among Parallelization Strategy, Collective Communication Library, and Network from the perspective of communication optimization, which forms a three-layer paradigm. We then review current representative research advances within this three-layer paradigm. We find that layers in the current three-layer paradigm are relatively independent and there is a rich design space for cross-layer collaborative optimization in distributed training scenarios. Therefore, we advocate Vertical and Horizontal co-designs which extend the three-layer paradigm to a five-layer paradigm. We also advocate Intra-Inter and Host-Net co-designs to further utilize the potential of heterogeneous resources. We hope this article can shed some light on future research on communication optimization for distributed training.

8/30/2024