Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

2310.19102

Published 4/17/2024 by Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

cs.LG

🏋️

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7times$ compared to the FP16 and by $2.5times$ compared to INT8 quantization, while maintaining the same latency target.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large Language Models (LLMs) are in high demand for various applications like content generation, chatbots, and sentiment analysis.
To use GPU resources efficiently and boost throughput, service providers often batch multiple requests.
Quantization techniques, like 8-bit weight-activation quantization, are used to reduce memory consumption and increase computing capacity.
However, existing quantization schemes cannot fully leverage modern GPU capabilities, leading to sub-optimal performance.

Plain English Explanation

Large Language Models (LLMs) are powerful AI systems that can perform a wide range of tasks, from generating human-like text to analyzing the sentiment of written content. As the demand for these models has grown, service providers have faced the challenge of using their computing resources, like GPUs, as efficiently as possible.

One popular approach is to batch multiple requests together, processing them in a single, larger operation. This can significantly boost the overall throughput, or the number of tasks the system can handle in a given time. To further speed up this batching process, techniques called quantization are used to reduce the amount of memory the models need and increase the computing capacity.

Quantization involves representing the model's numerical parameters, like the weights and activations, using fewer bits of information. For example, instead of using 16-bit or 32-bit floating-point numbers, the parameters can be stored and processed using only 8 bits. This saves memory and allows for faster computations.

However, the quantization methods commonly used today, such as 8-bit weight-activation quantization, don't fully take advantage of the latest GPU hardware capabilities, which can handle even lower bit-width operations (like 4-bit integers). As a result, the potential performance improvements from quantization are not being realized to the fullest extent.

Technical Explanation

To address this issue, the researchers introduced a new quantization method called Atom. Atom aims to maximize the serving throughput of LLMs by leveraging low-bit operators and fine-grained quantization techniques.

Atom achieves high throughput improvements with negligible accuracy loss by:

Using low-bit operators (e.g., 4-bit integer) to speed up the computation.
Applying a novel mixed-precision and fine-grained quantization process to maintain high model accuracy.

The researchers evaluated Atom on 4-bit weight-activation quantization in the serving context. Atom was able to improve end-to-end throughput (measured in tokens per second) by up to 7.7 times compared to using 16-bit floating-point operations, and by 2.5 times compared to 8-bit integer quantization, while maintaining the same latency target.

Critical Analysis

The Atom paper presents a promising approach to improving the efficiency of LLM serving, but it's important to consider some potential limitations and areas for further research:

The evaluations were conducted on a specific set of LLMs and tasks. It would be valuable to see how Atom performs on a wider range of models and applications to assess its generalizability.
The paper focuses on throughput improvements, but the impact on other important metrics, such as energy consumption and hardware utilization, could be explored further.
The fine-grained quantization techniques used in Atom may introduce additional complexity in the deployment and optimization of LLM serving systems. The trade-offs between the performance gains and the engineering overhead should be carefully considered.

Overall, the Atom paper makes a valuable contribution to the ongoing efforts to efficiently deploy large language models and mitigate the impact of quantization on model quality. Continued research in this area could lead to significant improvements in the scalability and cost-effectiveness of LLM-powered applications.

Conclusion

The Atom paper presents a novel quantization method that can significantly improve the serving throughput of large language models without sacrificing model accuracy. By leveraging low-bit operators and a mixed-precision quantization process, Atom is able to boost end-to-end throughput by up to 7.7 times compared to 16-bit floating-point operations and 2.5 times compared to 8-bit integer quantization.

This work is an important step towards more efficient deployment of large language models and mitigating the impact of quantization on model quality. As the demand for LLMs continues to grow, innovations like Atom will be crucial in ensuring these powerful AI systems can be used cost-effectively and at scale, benefiting a wide range of applications and industries.

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

🔄

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. Code is available at https://github.com/mit-han-lab/qserve.

5/8/2024

cs.CL cs.AI cs.LG cs.PF

📉

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

4/23/2024

cs.CL

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

4/17/2024

cs.LG cs.AI cs.CL