AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

2306.00978

Published 4/23/2024 by Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

cs.CL

📉

Abstract

Large language models (LLMs) have fundamentally transformed the capabilities of numerous applications, from natural language processing to more intricate domain-specific tasks in robotics and autonomous driving. Moreover, the importance of on-device LLMs has grown significantly in the recent years. Running LLMs on edge devices not only promises reduced latency and improved user experience but also aligns with the increasing need for user privacy, as data processing can occur locally. However, the astronomical model sizes of modern LLMs and constraints of the edge devices, primarily in terms of memory size and bandwidth, pose significant deployment challenges. In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for on-device LLM/VLMs, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Large language models (LLMs) have significantly improved numerous applications, from natural language processing to robotics and autonomous driving.
The importance of running LLMs on edge devices has grown, as it promises reduced latency, improved user experience, and better user privacy.
However, the large model sizes and constraints of edge devices pose significant deployment challenges.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. These models have revolutionized many industries, from helping computers communicate in natural language to powering advanced robotics and self-driving cars.

One exciting development is the ability to run these LLMs on edge devices, like smartphones and tablets. This local processing offers several benefits, such as faster response times, better privacy (since data doesn't need to be sent to a remote server), and a smoother user experience. Imagine asking your phone a question and getting an instant, personalized response, without your information leaving the device.

However, deploying these massive LLMs on edge devices is challenging. The models are astronomically large, often billions of parameters, while edge devices have limited memory and processing power. It's like trying to fit a skyscraper into a tiny shed - the pieces just don't fit.

Technical Explanation

This paper presents a new approach called Activation-aware Weight Quantization (AWQ) to address the challenge of running LLMs on edge devices. The key insight is that not all the model's weights (the internal parameters that define its behavior) are equally important. By protecting only the most critical 1% of the weights, the researchers were able to significantly reduce the model size without sacrificing performance.

The unique aspect of AWQ is that it determines which weights to protect by observing the model's activations (the intermediate outputs during the computation) rather than the weights themselves. This allows for better generalization to different domains and modalities without overfitting to a specific calibration set.

The paper also introduces TinyChat, an efficient and flexible inference framework tailored for running LLMs on edge devices. TinyChat achieves over 3x speedup compared to existing solutions, enabling the deployment of even the largest LLMs, like the 70B parameter Llama-2 model, on mobile GPUs.

Critical Analysis

The paper presents a compelling approach to the problem of deploying LLMs on edge devices, but there are a few potential areas for further exploration:

The authors mention that AWQ does not rely on any backpropagation or reconstruction, which may limit its ability to adapt to different model architectures or tasks. It would be interesting to see how well the method generalizes to a wider range of LLM types and applications.
The paper focuses on weight quantization, but there may be other techniques, such as model pruning or distillation, that could further reduce the model size and improve performance on edge devices.
The evaluation is primarily conducted on language modeling and domain-specific tasks like coding and math. It would be valuable to assess the approach's effectiveness on more diverse applications, including multi-modal tasks that combine text, images, and other modalities.

Overall, the research presents a promising step towards making powerful LLMs more accessible and practical for real-world, on-device applications.

Conclusion

The paper introduces a novel quantization technique called Activation-aware Weight Quantization (AWQ) that enables efficient and accurate deployment of large language models (LLMs) on edge devices. By selectively protecting the most critical weights and leveraging activation data, AWQ achieves impressive performance gains while maintaining the models' generalization abilities.

Alongside AWQ, the researchers developed TinyChat, an efficient inference framework that further boosts the performance of LLMs on mobile and desktop GPUs. These advancements could pave the way for a new generation of intelligent, privacy-preserving applications that bring the power of LLMs directly to users' fingertips.

As the field of on-device AI continues to evolve, this work highlights the importance of innovative approaches that address the unique challenges of running large-scale models on resource-constrained edge devices. By bridging the gap between cutting-edge AI and practical real-world deployment, the researchers have made a valuable contribution to the ongoing quest to democratize the benefits of advanced language models.

Related Papers

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24% and 70.48% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

4/17/2024

cs.LG cs.AI cs.CL

💬

On the Compressibility of Quantized Large Language Models

Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, Chun Jason Xue

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.

5/7/2024

cs.LG cs.AI cs.CL

Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim

We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher than than other channels, which prevents accurate low-bitwidth quantization with known techniques. We systematically study this phenomena and find that these outlier channels emerge early in training, and that they occur more frequently in layers with residual streams. We then propose a simple strategy which regularizes a layer's inputs via quantization-aware training (QAT) and its outputs via activation kurtosis regularization. We show that regularizing both the inputs and outputs is crucial for preventing a model's migrating the difficulty in input quantization to the weights, which makes post-training quantization (PTQ) of weights more difficult. When combined with weight PTQ, we show that our approach can obtain a W4A4 model that performs competitively to the standard-precision W16A16 baseline.

4/5/2024

cs.LG cs.CL