Inference Acceleration for Large Language Models on CPUs

2406.07553

Published 6/13/2024 by Ditto PS, Jithin VG, Adarsh MS

🤯

Abstract

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands. In this paper, we explore the utilization of CPUs for accelerating the inference of large language models. Specifically, we introduce a parallelized approach to enhance throughput by 1) Exploiting the parallel processing capabilities of modern CPU architectures, 2) Batching the inference request. Our evaluation shows the accelerated inference engine gives an 18-22x improvement in the generated token per sec. The improvement is more with longer sequence and larger models. In addition to this, we can also run multiple workers in the same machine with NUMA node isolation to further improvement in tokens/s. Table 2, we have received 4x additional improvement with 4 workers. This would also make Gen-AI based products and companies environment friendly, our estimates shows that CPU usage for Inference could reduce the power consumption of LLMs by 48.9% while providing production ready throughput and latency.

Create account to get full access

Overview

Large language models (LLMs) have shown impressive performance across various natural language processing (NLP) tasks.
Deploying these models for real-world applications often requires efficient inference solutions to handle the computational demands.
This paper explores the utilization of CPUs for accelerating the inference of large language models.

Plain English Explanation

Large language models (LLMs) are a type of artificial intelligence that can understand and generate human-like text. These models have become increasingly powerful in recent years, and can perform a wide range of natural language processing tasks with impressive results.

However, using these LLMs in real-world applications can be challenging because they require a lot of computational power to run. This paper looks at a way to make the inference (or the process of using the model to generate or analyze text) more efficient by using regular computer processors (CPUs) instead of specialized hardware like GPUs.

The researchers introduce a parallelized approach that can take advantage of the parallel processing capabilities of modern CPUs. This involves processing multiple inference requests at the same time (batching) to improve the overall throughput, or the rate at which the model can generate new text.

Their evaluation shows that this accelerated inference engine can provide an 18-22x improvement in the number of tokens (words or pieces of words) generated per second, compared to a baseline approach. The improvement is even greater for longer text sequences and larger models.

Additionally, the researchers found that by running multiple worker processes in parallel on the same machine, they could get an additional 4x improvement in tokens per second. This could make AI-powered products and companies more environmentally friendly, as the researchers estimate the CPU-based inference could reduce the power consumption of LLMs by nearly 50% while still providing the throughput and latency needed for production use.

Technical Explanation

The paper introduces a parallelized approach to accelerate the inference of large language models on CPU architectures. The key elements of their work include:

Exploiting Parallel Processing Capabilities: The researchers leverage the parallel processing capabilities of modern CPU architectures to improve the throughput of LLM inference. By processing multiple inference requests concurrently, they can better utilize the available CPU resources.
Batching Inference Requests: The approach includes batching multiple inference requests together, which allows for more efficient utilization of the CPU resources and higher overall throughput. This batching technique helps to amortize the overhead associated with each individual inference request.
Multi-Worker Parallelism: The researchers also explore running multiple worker processes in parallel on the same machine, leveraging Non-Uniform Memory Access (NUMA) architectures to achieve further performance improvements. This allows for increased tokens per second by parallelizing the inference workload.

The paper's evaluation shows that the accelerated inference engine provides an 18-22x improvement in generated tokens per second compared to a baseline approach. This advantage is even more pronounced for longer sequence lengths and larger models. Additionally, the researchers estimate that the CPU-based inference could reduce the power consumption of LLMs by 48.9% while still meeting the throughput and latency requirements for production use.

Critical Analysis

The paper presents a compelling approach to accelerating the inference of large language models using CPUs, which can be an important step towards making these powerful models more accessible and practical for real-world applications.

One potential limitation of the work is that it focuses solely on CPU-based inference and does not compare the performance to GPU-based or other specialized hardware accelerators. While the researchers provide estimates of power savings, a more comprehensive comparison across different hardware platforms could further strengthen the case for CPU-based inference.

Additionally, the paper does not delve into the potential trade-offs or limitations of the parallelized approach, such as the impact on model accuracy or the scalability of the technique to larger models or more complex NLP tasks. Further research may be needed to understand the broader applicability and the potential drawbacks of this approach.

It would also be valuable to see the researchers explore the integration of their techniques with other optimization methods, such as those discussed in the papers "Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Models", "Memory is All You Need: An Overview of Compute and Memory Augmented Transformers", and "InferCept: Efficient Intercept Support for Augmented Large Language Models". Combining multiple optimization techniques could lead to even greater performance and efficiency improvements.

Conclusion

This paper presents a promising approach to accelerating the inference of large language models using CPUs. By leveraging the parallel processing capabilities of modern CPUs and employing batching techniques, the researchers were able to achieve significant throughput improvements of 18-22x compared to a baseline approach.

The ability to run multiple worker processes in parallel on the same machine further enhances the performance, leading to an additional 4x improvement in tokens per second. Importantly, the researchers estimate that this CPU-based inference could reduce the power consumption of LLMs by nearly 50%, making these AI-powered technologies more environmentally friendly.

While the paper focuses on CPU-based optimization, integrating these techniques with other optimization methods, such as those explored in related research on hardware acceleration and memory-augmented transformers, could lead to even greater advancements in the efficiency and accessibility of large language models for real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤯

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Georgy Tyukin

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the performance of larger models, but with a reduced cost of running them. In this thesis we explore the methods of model compression, and we empirically demonstrate that the simple method of skipping latter attention sublayers in Transformer LLMs is an effective method of model compression, as these layers prove to be redundant, whilst also being incredibly computationally expensive. We observed a 21% speed increase in one-token generation for Llama 2 7B, whilst surprisingly and unexpectedly improving performance over several common benchmarks.

4/10/2024

cs.LG cs.AI cs.CL cs.PF

Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference

Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang

Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. The majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However, these methods often encounter challenges in achieving low latency due to considerable memory access overhead. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs. Our approach involves the specialization of distinct hardware units for specific operators or layers, facilitating direct communication between them through a dataflow architecture while minimizing off-chip memory accesses. We introduce a comprehensive analytical model for estimating the performance of a spatial LLM accelerator, taking into account the on-chip compute and memory resources available on an FPGA. Through our analysis, we can determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. To enable more productive implementations of an LLM model on FPGAs, we further provide a library of high-level synthesis (HLS) kernels that are composable and reusable. This library will be made available as open-source. To validate the effectiveness of both our analytical model and HLS library, we have implemented BERT and GPT2 on an AMD Alveo U280 FPGA device. Experimental results demonstrate our approach can achieve up to 13.4x speedup when compared to previous FPGA-based accelerators for the BERT model. For GPT generative inference, we attain a 2.2x speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9x speedup and a 5.7x improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.

4/9/2024

cs.LG cs.AI cs.AR cs.CL

InferCept: Efficient Intercept Support for Augmented Large Language Model Inference

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang

Large language models are increasingly integrated with external environments, tools, and agents like ChatGPT plugins to extend their capability beyond language-centric tasks. However, today's LLM inference systems are designed for standalone LLMs. They treat each external interaction as the end of LLM generation and form a new request when the interaction finishes, causing unnecessary recomputation of already computed contexts, which accounts for 37-40% of total model forwarding time. This paper presents InferCept, the first LLM inference framework targeting augmented LLMs and supporting the efficient interception of LLM generation. InferCept minimizes the GPU resource waste caused by LLM interceptions and dedicates saved memory for serving more requests. InferCept improves the overall serving throughput by 1.6x-2x and completes 2x more requests per second compared to the state-of-the-art LLM inference systems.

5/31/2024

cs.LG cs.CL cs.DC

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference

Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura

Large language models (LLMs) have recently transformed natural language processing, enabling machines to generate human-like text and engage in meaningful conversations. This development necessitates speed, efficiency, and accessibility in LLM inference as the computational and memory requirements of these systems grow exponentially. Meanwhile, advancements in computing and memory capabilities are lagging behind, exacerbated by the discontinuation of Moore's law. With LLMs exceeding the capacity of single GPUs, they require complex, expert-level configurations for parallel processing. Memory accesses become significantly more expensive than computation, posing a challenge for efficient scaling, known as the memory wall. Here, compute-in-memory (CIM) technologies offer a promising solution for accelerating AI inference by directly performing analog computations in memory, potentially reducing latency and power consumption. By closely integrating memory and compute elements, CIM eliminates the von Neumann bottleneck, reducing data movement and improving energy efficiency. This survey paper provides an overview and analysis of transformer-based models, reviewing various CIM architectures and exploring how they can address the imminent challenges of modern AI computing systems. We discuss transformer-related operators and their hardware acceleration schemes and highlight challenges, trends, and insights in corresponding CIM designs.

6/13/2024

cs.AR cs.LG