LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Read original: arXiv:2409.00918 - Published 9/4/2024 by Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Overview

This paper presents LuWu, an end-to-end in-network out-of-core optimizer for training 100B-scale models on distributed GPUs.
LuWu aims to address the challenges of training large language models on limited GPU memory by offloading compute and memory-intensive tasks to the network.
The system is designed to enable model-in-network data-parallel training, where the model parameters are distributed across the network, allowing for efficient scaling to thousands of GPUs.

Plain English Explanation

[Explanation of the core ideas and significance of the research in plain, accessible language]

The researchers developed a system called LuWu to help train extremely large AI models that are too big to fit on a single GPU. These models can have over 100 billion parameters, which are the internal settings that determine how the model works.

Training such large models is a major challenge because each GPU has limited memory, and the entire model can't fit on a single GPU. LuWu addresses this by offloading some of the computationally intensive tasks and memory requirements to the network infrastructure itself, rather than relying solely on the GPUs.

This model-in-network approach means that the model parameters are distributed across the network, allowing the training to be parallelized and scaled up to use thousands of GPUs. By taking advantage of the network's capabilities, LuWu enables efficient training of 100 billion-scale language models that would otherwise be impossible with current GPU hardware alone.

Technical Explanation

[Detailed summary of the key elements of the paper, including experiment design, architecture, and insights]

The paper presents the design and evaluation of LuWu, an end-to-end in-network out-of-core optimizer for training 100B-scale model-in-network data-parallel training on distributed GPUs. The core idea behind LuWu is to offload compute and memory-intensive tasks from the GPUs to the network infrastructure, enabling efficient training of extremely large language models that exceed the memory capacity of individual GPUs.

The LuWu architecture consists of several key components:

In-Network Optimizer: LuWu leverages the network's capabilities to perform gradient aggregation, parameter updates, and activation/gradient checkpointing within the network, reducing the load on the GPUs.
Out-of-Core Paging: LuWu implements an out-of-core paging mechanism that allows the model parameters to be swapped in and out of GPU memory as needed, circumventing the memory limitations of individual GPUs.
Model-in-Network Data Parallelism: The model parameters are distributed across the network, enabling data-parallel training to be scaled up to thousands of GPUs.

The paper evaluates the performance of LuWu on training 100B-scale language models and compares it to baseline approaches. The results demonstrate that LuWu can achieve significant speedups and memory savings compared to traditional GPU-centric training methods, enabling the efficient training of models that would otherwise be infeasible due to GPU memory constraints.

Critical Analysis

[Discussion of caveats, limitations, and areas for further research mentioned in the paper, as well as any additional concerns or potential issues]

The paper presents a compelling solution to the challenge of training extremely large language models on distributed GPU hardware. By leveraging the capabilities of the network infrastructure, LuWu is able to overcome the memory limitations of individual GPUs and enable the efficient training of 100B-scale models.

However, the paper also acknowledges several limitations and areas for further research:

Specialized Hardware Requirement: LuWu relies on specialized network hardware, such as programmable network switches, to perform the in-network optimizations. The availability and cost of such hardware may limit the widespread adoption of this approach.
Potential Network Bottlenecks: While LuWu aims to offload tasks to the network, there may be scenarios where the network itself becomes a bottleneck, limiting the overall performance gains.
Applicability to Other Model Architectures: The paper focuses on training large language models, and it's unclear how well the LuWu approach would generalize to other types of AI models, such as computer vision or reinforcement learning models.

It would be interesting to see further research exploring these limitations and investigating ways to address them, potentially expanding the applicability of in-network optimization techniques to a broader range of AI workloads and hardware configurations.

Conclusion

[Summary of the main takeaways and their potential implications]

The LuWu system presented in this paper represents a significant advancement in the field of large-scale AI model training. By leveraging the capabilities of the network infrastructure, LuWu enables the efficient training of 100B-scale language models that would otherwise be infeasible due to GPU memory constraints.

The key insights from this research include the ability to offload compute and memory-intensive tasks to the network, the model-in-network data parallelism approach, and the out-of-core paging mechanism to overcome GPU memory limitations. These innovations have the potential to dramatically accelerate the development and deployment of extremely large AI models, which are becoming increasingly important for natural language processing, knowledge representation, and other AI-powered applications.

As the field of AI continues to push the boundaries of model size and complexity, the LuWu system and similar in-network optimization techniques may prove to be essential for enabling the training and deployment of the next generation of AI systems at scale.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs

Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.

9/4/2024

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav Sengupta, Puneet Gupta, Arindam Mallik

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

7/23/2024

🌐

Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters

Weiyang Wang, Manya Ghobadi, Kayvon Shakeri, Ying Zhang, Naader Hasani

This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 8.2% to 11.2% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.

9/17/2024

Inference Performance Optimization for Large Language Models on CPUs

Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

7/11/2024