ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Read original: arXiv:2406.05981 - Published 7/26/2024 by Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Overview

Introduces a novel technique called ShiftAddLLM that accelerates the inference of pretrained large language models (LLMs) by reparameterizing the model's weight matrices in a way that eliminates the need for costly matrix multiplications.
ShiftAddLLM leverages a post-training optimization process to find a shift-and-add based representation of the original weight matrices that closely approximates the original model's behavior.
The authors demonstrate that ShiftAddLLM can achieve significant inference speedups on various LLMs without significant accuracy degradation, making it a promising approach for deploying high-performance LLMs on resource-constrained devices.

Plain English Explanation

The paper presents a new technique called ShiftAddLLM that aims to speed up the performance of large language models (LLMs) without significantly reducing their accuracy. LLMs are powerful AI models that can generate human-like text, but they can be computationally intensive to run, especially on devices with limited resources like smartphones or embedded systems.

The key insight behind ShiftAddLLM is to reparameterize the weight matrices inside the LLM in a way that eliminates the need for costly matrix multiplications during inference. Instead, the weights are represented using a combination of simple shift and add operations, which are much faster to perform on hardware. This is achieved through a post-training optimization process that finds the best shift-and-add approximation of the original weight matrices while preserving the model's overall behavior.

By avoiding expensive matrix multiplications, ShiftAddLLM can significantly accelerate the inference of LLMs without compromising their accuracy too much. This makes it a promising approach for deploying high-performance LLMs on a wider range of devices, including those with limited computational resources, such as mobile devices or embedded systems.

Technical Explanation

The paper introduces a novel technique called ShiftAddLLM that aims to accelerate the inference of pretrained large language models (LLMs) by reparameterizing the model's weight matrices in a way that eliminates the need for costly matrix multiplications.

The key idea behind ShiftAddLLM is to find a shift-and-add based representation of the original weight matrices that closely approximates the original model's behavior. This is achieved through a post-training optimization process that learns the optimal shift and add operations to be applied to the input during inference, allowing the model to produce similar outputs to the original LLM.

The authors evaluate ShiftAddLLM on various LLM architectures, including BERT, GPT-2, and RoBERTa, and demonstrate significant inference speedups (up to 3.5x) with minimal accuracy degradation (less than 1% in most cases). This makes ShiftAddLLM a promising approach for deploying high-performance LLMs on resource-constrained devices.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the ShiftAddLLM technique, exploring its performance on a diverse set of LLM architectures and tasks. The authors acknowledge that the accuracy degradation, while relatively small, may still be unacceptable for certain applications that require the highest possible model performance.

Additionally, the paper does not explore the memory footprint of the shift-and-add representation, which could be an important consideration for deployment on devices with limited memory. It would be valuable to understand how the memory consumption of ShiftAddLLM compares to the original LLMs, especially in the context of other model compression techniques like quantization or additive factorization.

Overall, the ShiftAddLLM approach is a promising step towards enabling the efficient deployment of high-performance LLMs on a wider range of devices, and the paper provides a solid foundation for further research and development in this area.

Conclusion

The ShiftAddLLM technique presented in this paper offers a novel way to accelerate the inference of pretrained large language models (LLMs) by reparameterizing the model's weight matrices to eliminate the need for costly matrix multiplications. By leveraging a post-training optimization process to find a shift-and-add based representation of the original weights, ShiftAddLLM can achieve significant inference speedups (up to 3.5x) with minimal accuracy degradation.

This approach has the potential to enable the deployment of high-performance LLMs on resource-constrained devices, such as mobile phones or embedded systems, where computational resources are limited. As the demand for powerful language AI continues to grow, techniques like ShiftAddLLM will become increasingly important for bridging the gap between model capability and real-world deployment.

The authors have provided a well-designed and thorough evaluation of their technique, and the promising results suggest that ShiftAddLLM is a valuable contribution to the ongoing efforts to efficiently compress and quantize large language models for practical applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

7/26/2024

👀

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Haoran You (Celine), Huihong Shi (Celine), Yipin Guo (Celine), Yingyan (Celine), Lin

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed $textbf{ShiftAddViT}$, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all $texttt{MatMuls}$ among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to $textbf{5.18$times$}$ latency reductions on GPUs and $textbf{42.9}$% energy savings, while maintaining a comparable accuracy as original or efficient ViTs.

6/12/2024

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency. Our code is available at https://github.com/Aaronhuang-778/BiLLM.

5/16/2024

ShiftAddAug: Augment Multiplication-Free Tiny Neural Network with Hybrid Computation

Yipin Guo, Zihao Li, Yilin Lang, Qinyuan Ren

Operators devoid of multiplication, such as Shift and Add, have gained prominence for their compatibility with hardware. However, neural networks (NNs) employing these operators typically exhibit lower accuracy compared to conventional NNs with identical structures. ShiftAddAug uses costly multiplication to augment efficient but less powerful multiplication-free operators, improving performance without any inference overhead. It puts a ShiftAdd tiny NN into a large multiplicative model and encourages it to be trained as a sub-model to obtain additional supervision. In order to solve the weight discrepancy problem between hybrid operators, a new weight sharing method is proposed. Additionally, a novel two stage neural architecture search is used to obtain better augmentation effects for smaller but stronger multiplication-free tiny neural networks. The superiority of ShiftAddAug is validated through experiments in image classification and semantic segmentation, consistently delivering noteworthy enhancements. Remarkably, it secures up to a 4.95% increase in accuracy on the CIFAR100 compared to its directly trained counterparts, even surpassing the performance of multiplicative NNs.

7/4/2024