SqueezeLLM: Dense-and-Sparse Quantization

2306.07629

Published 6/6/2024 by Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

cs.CL cs.LG

SqueezeLLM: Dense-and-Sparse Quantization

Abstract

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

Create account to get full access

Overview

The paper presents a novel technique called "SqueezeLLM" for compressing large language models (LLMs) using a combination of dense and sparse quantization.
The proposed method aims to significantly reduce the memory footprint and inference latency of LLMs without sacrificing their performance.
The paper demonstrates the effectiveness of SqueezeLLM on several benchmark tasks, showcasing its ability to achieve high compression rates while maintaining model accuracy.

Plain English Explanation

Large language models (LLMs) like BERT and GPT have become increasingly powerful, but they also require a lot of memory and computing power to run. This can make it challenging to deploy them on resource-constrained devices like smartphones or edge devices.

The researchers behind SqueezeLLM have come up with a way to "squeeze" these large models down to a much smaller size, without losing too much of their performance. They do this by using a combination of two techniques: dense quantization and sparse quantization.

Dense quantization involves reducing the precision of the model's numerical parameters, such as the weights and activations, from 32-bit floating-point numbers to lower-precision formats like 8-bit integers. This can significantly reduce the model's memory footprint, but it also has the potential to degrade the model's accuracy.

Sparse quantization, on the other hand, involves identifying the least important parameters in the model and removing them entirely. This can further reduce the model's size and improve its efficiency, while potentially having a smaller impact on accuracy than dense quantization alone.

By combining these two techniques, the researchers were able to create a highly compressed version of the model, called SqueezeLLM, that still performed well on a variety of benchmark tasks. This could make it easier to deploy LLMs on devices with limited resources, opening up new possibilities for real-world applications.

Technical Explanation

The paper presents a novel technique called "SqueezeLLM" for compressing large language models (LLMs) using a combination of dense and sparse quantization. The key elements of the proposed approach are as follows:

Dense Quantization: The researchers leverage SLIM-LLM, a salience-driven mixed-precision quantization method, to reduce the numerical precision of the model's parameters from 32-bit floating-point to lower-bit formats, such as 8-bit integers. This significantly reduces the model's memory footprint without introducing substantial accuracy degradation.

Sparse Quantization: In addition to dense quantization, the researchers apply One-Shot Sensitivity-Aware Mixed Sparsity Pruning to identify and remove the least important parameters in the model. This further reduces the model's size and improves its inference efficiency.

Balanced Combination: The key innovation in SqueezeLLM is the balanced combination of dense and sparse quantization. The researchers carefully tune the trade-off between these two techniques to achieve high compression rates while maintaining the model's accuracy and performance.

The paper evaluates SqueezeLLM on several benchmark tasks, including language modeling, question answering, and natural language inference. The results demonstrate that SqueezeLLM can achieve up to 10x reduction in model size and up to 5x improvement in inference latency, all while preserving the model's performance.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the SqueezeLLM technique. The researchers have carefully considered the trade-offs between model compression and accuracy, and have demonstrated the effectiveness of their approach on a range of benchmark tasks.

One potential limitation of the study is that it focuses mainly on the compression and inference efficiency of the models, without delving into the broader implications or real-world applications of the technology. It would be interesting to see how SqueezeLLM performs in more practical scenarios, such as on-device inference or edge computing applications.

Additionally, while the paper discusses the potential for further improvements in compression rates, it does not provide a clear roadmap for how these could be achieved. It would be valuable for the researchers to outline potential avenues for future work, such as exploring more advanced quantization techniques or investigating the scalability of the approach to larger language models.

Overall, the SqueezeLLM technique represents a significant contribution to the field of LLM compression and optimization, and the paper provides a solid foundation for further research and development in this area.

Conclusion

The SqueezeLLM paper presents a novel technique for compressing large language models using a combination of dense and sparse quantization. By carefully balancing these two approaches, the researchers have demonstrated the ability to achieve high compression rates while maintaining model accuracy and performance.

This work has important implications for the deployment of LLMs in resource-constrained environments, such as on-device inference or edge computing applications. By reducing the memory footprint and inference latency of these powerful models, SqueezeLLM could enable a wider range of real-world applications and expand the reach of advanced language technologies.

As the field of LLM compression and optimization continues to evolve, the insights and techniques presented in this paper will likely serve as a valuable reference for researchers and practitioners working to push the boundaries of model efficiency and deployability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

SDQ: Sparse Decomposed Quantization for LLM Inference

Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna

Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of parameters, which hinder the wide adaptation of those models due to their extremely large compute and memory requirements. To resolve the issue, various model compression methods are being actively investigated. In this work, we propose SDQ (Sparse Decomposed Quantization) to exploit both structured sparsity and quantization to achieve both high compute and memory efficiency. From our evaluations, we observe that SDQ can achieve 4x effective compute throughput with <1% quality drop.

6/21/2024

cs.LG cs.AI

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Renren Jin, Jiangcun Du, Wuwei Huang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.

6/7/2024

cs.CL cs.AI

💬

On the Compressibility of Quantized Large Language Models

Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, Chun Jason Xue

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.

5/7/2024

cs.LG cs.AI cs.CL

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi

Large language models (LLMs) achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.

5/27/2024

cs.LG cs.CL