LLM Pruning and Distillation in Practice: The Minitron Approach

Read original: arXiv:2408.11796 - Published 8/27/2024 by Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

394

LLM Pruning and Distillation in Practice: The Minitron Approach

Overview

This paper introduces the Minitron approach, a novel method for pruning and distilling large language models (LLMs) to create more compact and efficient models.
The Minitron approach leverages multiple smaller models, called "minitrons," to capture the knowledge of a larger LLM through a distillation process.
The key benefits of the Minitron approach are improved model performance, reduced model size, and faster inference times compared to the original LLM.

Plain English Explanation

The researchers developed a new way to make large language models (LLMs) smaller and faster, while still maintaining their performance. LLMs are powerful AI models that can understand and generate human-like text, but they are often very large and computationally intensive, making them difficult to use in real-world applications.

The Minitron approach works by taking a large LLM and "distilling" its knowledge into a collection of smaller, more efficient models called "minitrons." These minitrons are trained to collectively capture the same knowledge as the original LLM, but they require less computing power and memory to run.

The key idea is that by using multiple minitrons, the researchers can retain the full capabilities of the original LLM, while greatly reducing the model size and inference time. This makes the LLM much more practical to use in things like mobile apps, edge devices, or other applications where computational resources are limited.

The paper provides experimental results showing that the Minitron approach can achieve significant reductions in model size and inference time, while maintaining high performance on a variety of language tasks. This suggests that the Minitron approach could be a valuable tool for making powerful LLMs more accessible and usable in real-world applications.

Technical Explanation

The Minitron approach begins by taking a large, pre-trained LLM and using a pruning technique to identify the most important parameters in the model. These important parameters are then used to initialize a collection of smaller, "minitron" models.

The minitrons are trained using a knowledge distillation process, where they learn to collectively mimic the behavior of the original LLM. This ensures that the minitrons capture the full capabilities of the LLM, but in a more compact and efficient form.

The paper presents several key innovations in the Minitron approach:

Ensemble Distillation: The researchers use an ensemble of minitrons, rather than a single model, to capture the knowledge of the LLM. This improves the overall performance and robustness of the distilled model.
Adaptive Pruning: The pruning process adaptively identifies the most important parameters in the LLM, ensuring that the essential knowledge is retained in the minitrons.
Task-Specific Optimization: The minitrons can be further fine-tuned on specific tasks to optimize their performance for those applications.

The experimental results demonstrate that the Minitron approach can achieve significant reductions in model size (up to 10x) and inference time (up to 5x), while maintaining high performance on a variety of language tasks, such as text generation, question answering, and sentiment analysis.

Critical Analysis

The Minitron approach presents a promising solution for making large language models more practical and accessible. By distilling the knowledge of a large LLM into a collection of smaller, more efficient models, the researchers have addressed a key challenge in the deployment of these powerful AI systems.

However, the paper does not provide a detailed analysis of the trade-offs involved in the Minitron approach. For example, it is not clear how the performance and capabilities of the minitrons compare to the original LLM on specific tasks, or how the ensemble of minitrons is managed and optimized.

Additionally, the paper does not discuss the potential limitations of the Minitron approach, such as the complexity of training and maintaining the ensemble of minitrons, or the impact of the distillation process on the interpretability and explainability of the model.

Further research and experimentation may be needed to fully understand the strengths, weaknesses, and practical applications of the Minitron approach, and to explore potential improvements or extensions to the method.

Conclusion

The Minitron approach introduced in this paper represents a significant advancement in the field of large language model pruning and distillation. By leveraging an ensemble of smaller, more efficient models to capture the knowledge of a larger LLM, the researchers have demonstrated a practical solution for making these powerful AI systems more accessible and usable in real-world applications.

The key benefits of the Minitron approach, including improved model performance, reduced model size, and faster inference times, suggest that it could have a transformative impact on the deployment and adoption of large language models across a wide range of industries and use cases. As the field of AI continues to evolve, the Minitron approach may serve as a valuable tool for unlocking the full potential of these cutting-edge technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

394

LLM Pruning and Distillation in Practice: The Minitron Approach

Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

We present a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B for brevity) model from Mistral NeMo 12B. We found that with no access to the original data, it is beneficial to slightly fine-tune teacher models on the distillation dataset. We open-source our base model weights on Hugging Face with a permissive license.

8/27/2024

Compact Language Models via Pruning and Knowledge Distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

7/23/2024

💬

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song

Structured pruning of modern large language models (LLMs) has emerged as a way of decreasing their high computational needs. Width pruning reduces the size of projection weight matrices (e.g., by removing attention heads) while maintaining the number of layers. Depth pruning, in contrast, removes entire layers or blocks, while keeping the size of the remaining weights unchanged. Most current research focuses on either width-only or a blend of width and depth pruning, with little comparative analysis between the two units (width vs. depth) concerning their impact on LLM inference efficiency. In this work, we show that simple depth pruning can effectively compress LLMs while achieving comparable or superior performance to recent width pruning studies. Our pruning method boosts inference speeds, especially under memory-constrained conditions that require limited batch sizes for running LLMs, where width pruning is ineffective. In retraining pruned models for quality recovery, continued pretraining on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios. We hope this work can help build compact yet capable LLMs. Code and models can be found at: https://github.com/Nota-NetsPresso/shortened-llm

6/26/2024

💬

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

4/12/2024