MobileQuant: Mobile-friendly Quantization for On-device Language Models

Read original: arXiv:2408.13933 - Published 8/27/2024 by Fuwen Tan, Royson Lee, {L}ukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Overview

Introduces MobileQuant, a mobile-friendly quantization method for deploying large language models (LLMs) on-device
Focuses on enabling efficient and accurate inference of LLMs on mobile devices
Proposes several novel quantization techniques to reduce the model size and inference latency

Plain English Explanation

MobileQuant: Mobile-friendly Quantization for On-device Language Models is a research paper that presents a new approach for making large language models (LLMs) more efficient and accessible on mobile devices. LLMs are powerful AI models that can understand and generate human-like text, but they are typically very large and computationally intensive, making them challenging to deploy on resource-constrained mobile devices.

The key idea behind MobileQuant is to use a set of specialized quantization techniques to compress the LLM model, reducing its size and inference latency without significantly impacting its accuracy. Quantization is a process of reducing the precision of the model's weights and activations, which can lead to substantial memory and computational savings.

The researchers introduce several novel quantization methods, including an activation-aware weight quantization technique that takes into account the distribution of the model's activations to optimize the quantization process. They also propose a structured quantization approach that exploits the inherent sparsity and low-rank structure of LLMs to further reduce the model size.

By applying these quantization techniques, the researchers were able to achieve significant reductions in model size and inference latency while maintaining high accuracy, making LLMs more practical for deployment on mobile devices. This could enable a wide range of applications, such as personalized language assistants, real-time translation, and on-device text generation, to be accessed directly on users' smartphones and tablets.

Technical Explanation

MobileQuant: Mobile-friendly Quantization for On-device Language Models presents a novel approach for efficiently deploying large language models (LLMs) on mobile devices. The researchers focus on developing quantization techniques that can significantly reduce the model size and inference latency without sacrificing accuracy.

Quantization Techniques

The paper introduces several key quantization techniques:

Activation-Aware Weight Quantization: This method takes into account the distribution of the model's activations when determining the optimal quantization parameters, leading to better quantization of the weights.
Structured Quantization: The researchers exploit the inherent sparsity and low-rank structure of LLMs to apply different quantization strategies to different parts of the model, further reducing the overall model size.
Quantization-Aware Fine-tuning: The model is fine-tuned after quantization to recover any potential accuracy loss, ensuring high performance even with aggressive quantization.

Experimental Evaluation

The researchers evaluated MobileQuant on several popular LLMs, including BERT and GPT-2, across a range of mobile device platforms. They compared the performance of MobileQuant against other state-of-the-art quantization methods and demonstrated significant improvements in model size and inference latency, while maintaining high accuracy.

For example, on the GLUE benchmark, MobileQuant was able to achieve up to 4.5x reduction in model size and 3.8x reduction in inference latency compared to the original BERT model, with only a 1.4% drop in overall accuracy.

Critical Analysis

The MobileQuant paper presents a comprehensive and well-designed approach to enabling efficient deployment of LLMs on mobile devices. The researchers have thoroughly evaluated their techniques and demonstrated substantial improvements in model size and inference latency without sacrificing accuracy.

One potential limitation of the work is that it focuses primarily on quantization and does not explore other model compression techniques, such as pruning or distillation, which could potentially provide complementary benefits. Additionally, the evaluation is limited to a few popular LLM architectures and tasks, and it would be interesting to see how MobileQuant performs on a broader range of models and applications.

Furthermore, the paper does not delve into the practical implications of deploying MobileQuant-compressed LLMs on real-world mobile devices, such as battery life, memory usage, and user experience. These factors could have a significant impact on the real-world usability of the proposed approach.

Conclusion

MobileQuant: Mobile-friendly Quantization for On-device Language Models presents a promising solution for bringing the power of large language models to mobile devices. By introducing a suite of novel quantization techniques, the researchers have shown that it is possible to significantly reduce the size and inference latency of LLMs while maintaining high accuracy.

This work has the potential to enable a wide range of mobile applications that leverage the capabilities of LLMs, such as personalized language assistants, real-time translation, and on-device text generation. As mobile devices continue to play an increasingly central role in our daily lives, research like MobileQuant will be crucial in making advanced AI technologies accessible and usable on these platforms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Fuwen Tan, Royson Lee, {L}ukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20%-50% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

8/27/2024

💬

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

7/19/2024

📉

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.

7/19/2024

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024