Compact Language Models via Pruning and Knowledge Distillation

Read original: arXiv:2407.14679 - Published 7/23/2024 by Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

Compact Language Models via Pruning and Knowledge Distillation

Overview

Compact Language Models via Pruning and Knowledge Distillation is a research paper that explores methods for compressing large language models while maintaining their performance.
The key ideas include pruning model parameters and knowledge distillation, which transfer knowledge from a larger "teacher" model to a smaller "student" model.
The researchers tested their techniques on popular language models like BERT and GPT-2, achieving significant size reductions with minimal accuracy loss.

Plain English Explanation

Large language models like BERT and GPT-2 have achieved impressive performance on various natural language tasks. However, these models can be very large, requiring substantial computational resources to run. This makes them challenging to deploy on resource-constrained devices like smartphones or edge computing systems.

The researchers in this paper explored two main techniques to compress these large models:

Pruning: This involves selectively removing model parameters (the numerical values that define the model's behavior) that are deemed less important. By carefully pruning away parts of the model, it can be made significantly smaller without losing too much accuracy.
Knowledge Distillation: This involves training a smaller "student" model to mimic the behavior of a larger "teacher" model. The student model learns to approximate the outputs of the teacher model, allowing it to achieve similar performance in a more compact form.

By combining these techniques, the researchers were able to greatly reduce the size of popular language models like BERT and GPT-2 while preserving a large portion of their original capabilities. This could enable these powerful models to be deployed on a wider range of hardware, from powerful servers to resource-constrained edge devices.

Technical Explanation

The researchers first explored pruning techniques to remove less important model parameters. They experimented with various pruning methods, such as magnitude-based pruning, which removes parameters with small absolute values, and iterative pruning, which prunes parameters in multiple rounds.

To further compress the models, the researchers then applied knowledge distillation. This involves training a smaller "student" model to mimic the behavior of a larger "teacher" model. The student model learns to predict the same outputs as the teacher model, allowing it to achieve similar performance in a more compact form.

The researchers tested their techniques on popular language models like BERT and GPT-2. They were able to achieve significant size reductions, such as compressing BERT from 110 million parameters to just 13 million parameters, while maintaining a large portion of the original model's accuracy.

Critical Analysis

The researchers thoroughly explored the trade-offs between model size and performance, providing valuable insights for practitioners looking to deploy large language models in resource-constrained environments. However, the paper does not address potential issues that could arise from aggressive pruning or knowledge distillation, such as potential loss of rare or important information, or the impact on downstream tasks beyond the ones tested.

Additionally, the researchers only evaluated their techniques on a limited set of language models and tasks. It would be valuable to see how these methods perform on a wider range of models and applications, including more specialized or domain-specific language models.

Conclusion

This research demonstrates that it is possible to significantly reduce the size of large language models through a combination of pruning and knowledge distillation, without sacrificing too much of their original capabilities. These techniques could enable the deployment of powerful natural language processing models on a wider range of hardware, from powerful servers to edge devices. As AI systems become more ubiquitous, efficient model compression will be an increasingly important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Compact Language Models via Pruning and Knowledge Distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

7/23/2024

LLM Pruning and Distillation in Practice: The Minitron Approach

394

LLM Pruning and Distillation in Practice: The Minitron Approach

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

8/27/2024

👨‍🏫

Just CHOP: Embarrassingly Simple LLM Compression

Ananya Harsh Jha, Tom Sherborne, Evan Pete Walsh, Dirk Groeneveld, Emma Strubell, Iz Beltagy

Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical step in the compression process, the pretrain-then-finetune paradigm, has largely been overlooked when adapting existing pruning strategies to LLMs or proposing new ones. In this work, we show that embarrassingly simple layer pruning coupled with an extended language model pretraining as the finetuning phase produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale while being more inference efficient. We call this method LayerChop, where we deterministically remove layers from a model followed by task-agnostic finetuning of the remaining weights by continued self-supervised pretraining. At this scale, we also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.

7/11/2024

📈

Contemporary Model Compression on Large Language Models Inference

Dong Liu

Large Language Models (LLMs) have revolutionized natural language processing by achieving state-of-the-art results across a variety of tasks. However, the computational demands of LLM inference, including high memory consumption and slow processing speeds, pose significant challenges for real-world applications, particularly on resource-constrained devices. Efficient inference is crucial for scaling the deployment of LLMs to a broader range of platforms, including mobile and edge devices. This survey explores contemporary techniques in model compression that address these challenges by reducing the size and computational requirements of LLMs while maintaining their performance. We focus on model-level compression methods, including quantization, knowledge distillation, and pruning, as well as system-level optimizations like KV cache efficient design. Each of these methodologies offers a unique approach to optimizing LLMs, from reducing numerical precision to transferring knowledge between models and structurally simplifying neural networks. Additionally, we discuss emerging trends in system-level design that further enhance the efficiency of LLM inference. This survey aims to provide a comprehensive overview of current advancements in model compression and their potential to make LLMs more accessible and practical for diverse applications.

9/4/2024