Extreme Compression of Large Language Models via Additive Quantization

2401.06118

Published 6/11/2024 by Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

Extreme Compression of Large Language Models via Additive Quantization

Abstract

The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of ``extreme'' LLM compression -- defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter -- from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

Create account to get full access

Overview

The paper presents a novel compression technique called Additive Quantization (AQ) that can significantly reduce the size of large language models (LLMs) while maintaining their performance.
AQ works by decomposing the model's weight tensors into a sum of smaller, quantized tensors, allowing for efficient compression without major accuracy loss.
The authors demonstrate the effectiveness of AQ on several popular LLMs, including BERT, GPT-2, and Megatron-LM, achieving up to 10x compression rates with minimal performance degradation.

Plain English Explanation

Large language models (LLMs) like BERT and GPT-2 have become incredibly powerful, but they also take up a lot of space on our computers and phones. This makes it difficult to use them on devices with limited storage, like smartphones or embedded systems.

The researchers in this paper came up with a new way to compress these large models, called Additive Quantization (AQ). The key idea is to break down the model's internal weights and parameters into smaller, more compact pieces. These pieces are then added back together to reconstruct the original model, but the individual pieces are much smaller in size.

The great thing about AQ is that it can shrink the model by up to 10 times its original size, without significantly hurting the model's performance. This means you can run these powerful language models on devices with limited storage, opening up new possibilities for applications like real-time translation or voice assistants on smartphones.

Technical Explanation

The paper proposes a novel compression technique called Additive Quantization (AQ) that can significantly reduce the size of large language models (LLMs) without compromising their performance.

The key idea behind AQ is to decompose the weight tensors of the LLM into a sum of smaller, quantized tensors. This allows for efficient compression, as the quantized tensors can be stored using fewer bits than the original floating-point weights.

Specifically, the authors formulate the compression problem as an optimization task, where they seek to find the set of quantized tensors that best approximate the original weight tensors. They solve this optimization problem using an alternating minimization algorithm, which iteratively updates the quantized tensors and the reconstruction coefficients.

The authors evaluate the effectiveness of AQ on several popular LLMs, including BERT, GPT-2, and Megatron-LM. They demonstrate that AQ can achieve up to 10x compression rates while maintaining the original model's performance on various language tasks.

Critical Analysis

The paper presents a compelling approach to compressing large language models, but there are a few potential limitations and areas for further research:

The authors focus on post-training compression, which may not be as effective as techniques that incorporate compression during the training process, such as SqueezeLLM or AWQ.
The experiments are limited to a few popular LLMs, and it's unclear how well the AQ technique would generalize to other model architectures or tasks beyond language modeling.
The paper does not address the potential impact of AQ on the inference latency or energy consumption of the compressed models, which are crucial factors for real-world deployment.
The authors could have explored the trade-offs between compression rate and model performance in more depth, providing guidelines for practitioners on how to choose the appropriate level of compression for their specific use cases.

Conclusion

The Additive Quantization (AQ) technique presented in this paper is a promising approach for compressing large language models, allowing them to be deployed on a wider range of devices with limited storage. By decomposing the weight tensors into a sum of smaller, quantized tensors, AQ can achieve up to 10x compression rates while maintaining the original model's performance.

This work has the potential to unlock new applications for LLMs, such as real-time translation or voice assistants on smartphones. However, further research is needed to address the limitations and explore the broader implications of this compression technique, including its impact on inference latency and energy consumption.

Overall, the paper makes a valuable contribution to the field of model compression, demonstrating the effectiveness of AQ and paving the way for more compact and efficient large language models in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Andrei Tomut, Saeed S. Jahromi, Abhijoy Sarkar, Uygar Kurt, Sukhbinder Singh, Faysal Ishtiaq, Cesar Mu~noz, Prabdeep Singh Bajaj, Ali Elborady, Gianni del Bimbo, Mehrazin Alizadeh, David Montero, Pablo Martin-Ramiro, Muhammad Ibrahim, Oussama Tahiri Alaoui, John Malcolm, Samuel Mugel, Roman Orus

Large Language Models (LLMs) such as ChatGPT and LlaMA are advancing rapidly in generative Artificial Intelligence (AI), but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there is no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% the memory size of LlaMA 7B, reducing also 70% the number of parameters, accelerating 50% the training and 25% the inference times of the model, and just with a small accuracy drop of 2% - 3%, going much beyond of what is achievable today by other compression techniques. Our methods also allow to perform a refined layer sensitivity profiling, showing that deeper layers tend to be more suitable for tensor network compression, which is compatible with recent observations on the ineffectiveness of those layers for LLM performance. Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all.

5/14/2024

cs.CL cs.AI cs.LG

💬

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang

Large Language Models (LLMs) excel in NLP, but their demands hinder their widespread deployment. While Quantization-Aware Training (QAT) offers a solution, its extensive training costs make Post-Training Quantization (PTQ) a more practical approach for LLMs. In existing studies, activation outliers in particular channels are identified as the bottleneck to PTQ accuracy. They propose to transform the magnitudes from activations to weights, which however offers limited alleviation or suffers from unstable gradients, resulting in a severe performance drop at low-bitwidth. In this paper, we propose QLLM, an accurate and efficient low-bitwidth PTQ method designed for LLMs. QLLM introduces an adaptive channel reassembly technique that reallocates the magnitude of outliers to other channels, thereby mitigating their impact on the quantization range. This is achieved by channel disassembly and channel assembly, which first breaks down the outlier channels into several sub-channels to ensure a more balanced distribution of activation magnitudes. Then similar channels are merged to maintain the original channel number for efficiency. Additionally, an adaptive strategy is designed to autonomously determine the optimal number of sub-channels for channel disassembly. To further compensate for the performance loss caused by quantization, we propose an efficient tuning method that only learns a small number of low-rank weights while freezing the pre-trained quantized model. After training, these low-rank parameters can be fused into the frozen weights without affecting inference. Extensive experiments on LLaMA-1 and LLaMA-2 show that QLLM can obtain accurate quantized models efficiently. For example, QLLM quantizes the 4-bit LLaMA-2-70B within 10 hours on a single A100-80G GPU, outperforming the previous state-of-the-art method by 7.89% on the average accuracy across five zero-shot tasks.

4/9/2024

cs.CL cs.AI cs.LG

CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs

Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Parameter quantization for Large Language Models (LLMs) has attracted increasing attentions recently in reducing memory costs and improving computational efficiency. Early approaches have been widely adopted. However, the existing methods suffer from poor performance in low-bit (such as 2 to 3 bits) scenarios. In this paper, we present a novel and effective Column-Level Adaptive weight Quantization (CLAQ) framework by introducing three different types of adaptive strategies for LLM quantization. Firstly, a K-Means clustering based algorithm is proposed that allows dynamic generation of quantization centroids for each column of a parameter matrix. Secondly, we design an outlier-guided adaptive precision search strategy which can dynamically assign varying bit-widths to different columns. Finally, a dynamic outlier reservation scheme is developed to retain some parameters in their original float point precision, in trade off of boosted model performance. Experiments on various mainstream open source LLMs including LLaMA-1, LLaMA-2 and Yi demonstrate that our methods achieve the state-of-the-art results across different bit settings, especially in extremely low-bit scenarios. Code is available at https://github.com/fayuge/CLAQ.

6/4/2024

cs.LG

ApiQ: Finetuning of 2-Bit Quantized Large Language Model

Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz

Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.

6/24/2024

cs.LG cs.CL