TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Read original: arXiv:2410.00531 - Published 10/2/2024 by Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Overview

The paper proposes a novel technique called TPI-LLM to efficiently serve large language models (LLMs) with up to 70 billion parameters on low-resource edge devices.
TPI-LLM leverages tensor partitioning and pipelining to split the model across multiple devices, enabling parallel processing and reducing the memory footprint.
Experimental results show that TPI-LLM can achieve comparable performance to edge-optimized inference engines while using significantly fewer resources.

Plain English Explanation

The paper explores a new way to run very large language models, which are a type of artificial intelligence that can understand and generate human-like text, on devices with limited computing power, like smartphones or small embedded systems. These large models can have up to 70 billion parameters, which are the numerical values that define the model's behavior.

Typically, running such huge models requires a lot of memory and processing power, making it difficult to use them on low-resource edge devices. The researchers propose a technique called TPI-LLM that can split the model across multiple devices, allowing them to work on different parts of the input in parallel. This tensor partitioning and pipelining approach reduces the memory needed on each device and speeds up the overall processing.

The results show that TPI-LLM can achieve similar performance to specialized inference engines optimized for edge devices, but with significantly lower resource requirements. This means large language models could potentially be deployed on a wider range of devices, bringing their capabilities to a broader range of applications and users.

Technical Explanation

The key innovation in the paper is the TPI-LLM (Tensor Partitioning and Pipelining for Large Language Models) technique, which aims to efficiently serve 70B-scale LLMs on low-resource edge devices.

TPI-LLM works by partitioning the model's tensors (multidimensional arrays of parameters) across multiple devices and pipelining the computation. This allows the model to be processed in parallel, reducing the memory footprint on each individual device.

The paper first presents several observations and motivations that informed the design of TPI-LLM. These include the challenges of deploying large LLMs on edge devices, the potential benefits of tensor partitioning, and the need for a flexible and scalable solution.

The TPI-LLM architecture is then described, which involves splitting the model into smaller partitions, distributing them across devices, and orchestrating the pipelined computation. The researchers also discuss techniques to optimize the partitioning and load balancing.

The experimental evaluation compares TPI-LLM to state-of-the-art edge-optimized inference engines, demonstrating that TPI-LLM can achieve comparable performance while using significantly fewer resources.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the TPI-LLM approach, including a range of experiments and comparisons to existing techniques. However, there are a few potential limitations and areas for further research:

The evaluation is primarily focused on inference performance and does not explore the training and fine-tuning of large LLMs using the TPI-LLM approach. This could be an important area for future research, as the ability to efficiently train and update models on edge devices could have significant practical implications.
The paper does not provide detailed cost and energy efficiency analysis, which would be an important consideration for real-world deployment of TPI-LLM on resource-constrained edge devices.
The scalability of TPI-LLM to even larger models (e.g., 100B or 500B parameters) is not thoroughly investigated, and the potential challenges and limitations at these scales are not discussed.
The generalization of TPI-LLM to other types of large-scale models beyond just language models, such as vision transformers or multimodal models, could be an interesting direction for future research.

Conclusion

The TPI-LLM technique presented in this paper offers a promising approach for efficiently serving large language models on low-resource edge devices. By leveraging tensor partitioning and pipelining, the method can achieve comparable performance to specialized inference engines while using significantly fewer resources.

This work addresses an important challenge in the field of edge AI, as the deployment of powerful large language models on a wide range of devices could unlock new applications and user experiences. The TPI-LLM technique represents a step forward in making such models more accessible and practical for real-world use cases.

While the paper provides a robust evaluation, there are some areas for further research and exploration, such as the integration of training and fine-tuning capabilities, a deeper analysis of cost and energy efficiency, and the potential for scaling to even larger models and other model types. As the field of edge AI continues to evolve, techniques like TPI-LLM will play a crucial role in bridging the gap between state-of-the-art AI and resource-constrained deployment environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

New!TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu

Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

10/2/2024

Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi Wang, Chen Meng, Sheng Gui

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.

7/2/2024

🤯

Efficient LLM inference solution on Intel GPU

Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

6/26/2024

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu, Wenzhe Guo, Ali BanaGozar, Udari De Alwis, Sourav Sengupta, Puneet Gupta, Arindam Mallik

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

7/23/2024