Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

2405.03594

Published 5/7/2024 by Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh and 2 others

cs.CL cs.AI

❗

Abstract

Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

Create account to get full access

Overview

Large language models (LLMs) have revolutionized natural language processing (NLP), but their large size creates computational bottlenecks.
This paper introduces a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity.
The researchers achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset.
The paper also demonstrates training acceleration due to sparsity on Cerebras CS-3 chips, inference acceleration of up to 3x on CPUs using Neural Magic's DeepSparse engine, and 1.7x on GPUs using Neural Magic's nm-vllm engine.
Additional gains are achieved through quantization, leading to a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
The results are demonstrated across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. However, these models are incredibly complex and require a lot of computing power to train and run. This can create bottlenecks, making it difficult to use LLMs in real-world applications.

The researchers in this paper have come up with a clever solution to this problem. They have developed a way to create smaller, more efficient versions of LLMs that can still perform just as well as the original models. They do this by using a technique called "sparsity," which basically means removing unnecessary parts of the model without affecting its accuracy.

Specifically, the researchers took the LLaMA-2 7B model, which is a very large and powerful LLM, and used a method called "SparseGPT" to prune it down to a smaller, sparser version. They also trained this sparse model on a carefully curated dataset to help it maintain its performance.

The result is a model that is up to 70% smaller than the original, but can still do all the same tasks just as well. The researchers also found that this sparse model can be trained and run much faster, both on CPUs and GPUs, thanks to the reduced computational load.

To make things even better, the researchers also showed that they could further improve the speed of these sparse models by using a technique called "quantization," which reduces the amount of memory they need to use.

Overall, this work is really exciting because it shows how we can create powerful AI systems that are much more efficient and practical to use in real-world applications. This could open the door to all kinds of new and exciting uses for LLMs in the future.

Technical Explanation

The researchers in this paper introduced a novel approach to create accurate, sparse foundational versions of performant large language models (LLMs) that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity.

They achieved this for the LLaMA-2 7B model by combining two key techniques:

The SparseGPT one-shot pruning method, which allows for efficient pruning of the model while maintaining accuracy.
Sparse pretraining of the pruned models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset.

The researchers demonstrated that this approach leads to training acceleration due to sparsity on Cerebras CS-3 chips, closely matching theoretical scaling. They also showed inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine.

Additionally, the researchers achieved further speedups through the use of quantization, leading to a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.

The researchers evaluated the performance of their sparse, quantized LLaMA models on a diverse range of challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization, demonstrating the generality of their approach.

Critical Analysis

The researchers have made a significant contribution to the field of large language models by developing a novel approach to create accurate, sparse versions of performant LLMs. The use of the SparseGPT one-shot pruning method and sparse pretraining on carefully curated datasets is a clever way to achieve high levels of sparsity while maintaining model performance.

However, the paper does not provide much information on the specifics of the sparse pretraining process, such as the exact subset of the SlimPajama and The Stack datasets used, or the hyperparameters and training procedures employed. More details in this regard would be helpful for researchers looking to reproduce or build upon this work.

Additionally, the paper does not explore the potential trade-offs between the level of sparsity achieved and the performance of the models on specific tasks. It would be interesting to see how the performance of the sparse models compares to the original LLaMA-2 7B model across a wider range of tasks and metrics.

Finally, while the researchers have demonstrated impressive inference acceleration on both CPUs and GPUs, it would be valuable to understand the energy and power consumption of these sparse, quantized models compared to the original LLaMA-2 7B model. This information could be crucial for real-world deployments, particularly in resource-constrained environments.

Conclusion

This paper presents a novel approach to creating accurate, sparse versions of large language models that can achieve significant performance gains without sacrificing accuracy. By combining the SparseGPT one-shot pruning method and sparse pretraining on carefully curated datasets, the researchers were able to develop sparse versions of the LLaMA-2 7B model that are up to 70% smaller but maintain full accuracy recovery for fine-tuning tasks.

The researchers further demonstrated the practical benefits of these sparse models, showing training acceleration on Cerebras CS-3 chips and inference acceleration of up to 3x on CPUs and 1.7x on GPUs. Additional gains were achieved through quantization, leading to a total speedup on CPUs of up to 8.6x.

This work paves the way for the development of smaller, faster, and more efficient large language models, which could enable a wide range of new applications and use cases for this transformative technology. By addressing the computational bottlenecks inherent in large, complex language models, this research represents an important step forward in making these powerful AI systems more accessible and practical for real-world deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Sparsity-Accelerated Training for Large Language Models

Da Ma, Lu Chen, Pengyu Wang, Hongshen Xu, Hanqi Li, Liangtai Sun, Su Zhu, Shuai Fan, Kai Yu

Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks but often require additional training, such as continual pre-training and supervised fine-tuning. However, the costs associated with this, primarily due to their large parameter count, remain high. This paper proposes leveraging emph{sparsity} in pre-trained LLMs to expedite this training process. By observing sparsity in activated neurons during forward iterations, we identify the potential for computational speed-ups by excluding inactive neurons. We address associated challenges by extending existing neuron importance evaluation metrics and introducing a ladder omission rate scheduler. Our experiments on Llama-2 demonstrate that Sparsity-Accelerated Training (SAT) achieves comparable or superior performance to standard training while significantly accelerating the process. Specifically, SAT achieves a $45%$ throughput improvement in continual pre-training and saves $38%$ training time in supervised fine-tuning in practice. It offers a simple, hardware-agnostic, and easily deployable framework for additional LLM training. Our code is available at https://github.com/OpenDFM/SAT.

6/7/2024

cs.CL

💬

One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models

Hang Shao, Bei Liu, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian

Various Large Language Models~(LLMs) from the Generative Pretrained Transformer(GPT) family have achieved outstanding performances in a wide range of text generation tasks. However, the enormous model sizes have hindered their practical use in real-world applications due to high inference latency. Therefore, improving the efficiencies of LLMs through quantization, pruning, and other means has been a key issue in LLM studies. In this work, we propose a method based on Hessian sensitivity-aware mixed sparsity pruning to prune LLMs to at least 50% sparsity without the need of any retraining. It allocates sparsity adaptively based on sensitivity, allowing us to reduce pruning-induced error while maintaining the overall sparsity level. The advantages of the proposed method exhibit even more when the sparsity is extremely high. Furthermore, our method is compatible with quantization, enabling further compression of LLMs. We have released the available code.

4/24/2024

cs.CL cs.AI

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

5/27/2024

cs.CL

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer

Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

6/6/2024

cs.CL cs.LG